Big Bird: Transformers for Longer Sequences

Overview

Big Bird: Transformers for Longer Sequences

Not an official Google product.

What is BigBird?

BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. Moreover, BigBird comes along with a theoretical understanding of the capabilities of a complete transformer that the sparse model can handle.

As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization.

More details and comparisons can be found in our presentation.

Citation

If you find this useful, please cite our NeurIPS 2020 paper:

@article{zaheer2020bigbird,
  title={Big bird: Transformers for longer sequences},
  author={Zaheer, Manzil and Guruganesh, Guru and Dubey, Kumar Avinava and Ainslie, Joshua and Alberti, Chris and Ontanon, Santiago and Pham, Philip and Ravula, Anirudh and Wang, Qifan and Yang, Li and others},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}

Code

The most important directory is core. There are three main files in core.

  • attention.py: Contains BigBird linear attention mechanism
  • encoder.py: Contains the main long sequence encoder stack
  • modeling.py: Contains packaged BERT and seq2seq transformer models with BigBird attention

Colab/IPython Notebook

A quick fine-tuning demonstration for text classification is provided in imdb.ipynb

Create GCP Instance

Please create a project first and create an instance in a zone which has quota as follows

gcloud compute instances create \
  bigbird \
  --zone=europe-west4-a \
  --machine-type=n1-standard-16 \
  --boot-disk-size=50GB \
  --image-project=ml-images \
  --image-family=tf-2-3-1 \
  --maintenance-policy TERMINATE \
  --restart-on-failure \
  --scopes=cloud-platform

gcloud compute tpus create \
  bigbird \
  --zone=europe-west4-a \
  --accelerator-type=v3-32 \
  --version=2.3.1

gcloud compute ssh --zone "europe-west4-a" "bigbird"

For illustration we used instance name bigbird and zone europe-west4-a, but feel free to change them. More details about creating Google Cloud TPU can be found in online documentations.

Instalation and checkpoints

git clone https://github.com/google-research/bigbird.git
cd bigbird
pip3 install -e .

You can find pretrained and fine-tuned checkpoints in our Google Cloud Storage Bucket.

Optionally, you can download them using gsutil as

mkdir -p bigbird/ckpt
gsutil cp -r gs://bigbird-transformer/ bigbird/ckpt/

The storage bucket contains:

  • pretrained BERT model for base(bigbr_base) and large (bigbr_large) size. It correspond to BERT/RoBERTa-like encoder only models. Following original BERT and RoBERTa implementation they are transformers with post-normalization, i.e. layer norm is happening after the attention layer. However, following Rothe et al, we can use them partially in encoder-decoder fashion by coupling the encoder and decoder parameters, as illustrated in bigbird/summarization/roberta_base.sh launch script.
  • pretrained Pegasus Encoder-Decoder Transformer in large size(bigbp_large). Again following original implementation of Pegasus, they are transformers with pre-normalization. They have full set of separate encoder-decoder weights. Also for long document summarization datasets, we have converted Pegasus checkpoints (model.ckpt-0) for each dataset and also provided fine-tuned checkpoints (model.ckpt-300000) which works on longer documents.
  • fine-tuned tf.SavedModel for long document summarization which can be directly be used for prediction and evaluation as illustrated in the colab nootebook.

Running Classification

For quickly starting with BigBird, one can start by running the classification experiment code in classifier directory. To run the code simply execute

export GCP_PROJECT_NAME=bigbird-project  # Replace by your project name
export GCP_EXP_BUCKET=gs://bigbird-transformer-training/  # Replace
sh -x bigbird/classifier/base_size.sh

Using BigBird Encoder instead BERT/RoBERTa

To directly use the encoder instead of say BERT model, we can use the following code.

from bigbird.core import modeling

bigb_encoder = modeling.BertModel(...)

It can easily replace BERT's encoder.

Alternatively, one can also try playing with layers of BigBird encoder

from bigbird.core import encoder

only_layers = encoder.EncoderStack(...)

Understanding Flags & Config

All the flags and config are explained in core/flags.py. Here we explain some of the important config paramaters.

attention_type is used to select the type of attention we would use. Setting it to block_sparse runs the BigBird attention module.

flags.DEFINE_enum(
    "attention_type", "block_sparse",
    ["original_full", "simulated_sparse", "block_sparse"],
    "Selecting attention implementation. "
    "'original_full': full attention from original bert. "
    "'simulated_sparse': simulated sparse attention. "
    "'block_sparse': blocked implementation of sparse attention.")

block_size is used to define the size of blocks, whereas num_rand_blocks is used to set the number of random blocks. The code currently uses window size of 3 blocks and 2 global blocks. The current code only supports static tensors.

Important points to note:

  • Hidden dimension should be divisible by the number of heads.
  • Currently the code only handles tensors of static shape as it is primarily designed for TPUs which only works with statically shaped tensors.
  • For sequene length less than 1024, using original_full is advised as there is no benefit in using sparse BigBird attention.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

ALBERT ***************New March 28, 2020 *************** Add a colab tutorial to run fine-tuning for GLUE datasets. ***************New January 7, 2020

Google Research 3k Dec 26, 2022
Convolutional 2D Knowledge Graph Embeddings resources

ConvE Convolutional 2D Knowledge Graph Embeddings resources. Paper: Convolutional 2D Knowledge Graph Embeddings Used in the paper, but do not use thes

Tim Dettmers 586 Dec 24, 2022
RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

RIDE: ROS IDE RIDE automatically creates the package and boilerplate OOP Python code for nodes as per your needs (RIDE is not an IDE, but even ROS isn

Jash Mota 20 Jul 14, 2022
Code and data accompanying Natural Language Processing with PyTorch

Natural Language Processing with PyTorch Build Intelligent Language Applications Using Deep Learning By Delip Rao and Brian McMahan Welcome. This is a

Joostware 1.8k Jan 01, 2023
State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Trapper (Transformers wRAPPER) Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps h

Open Business Software Solutions 42 Sep 21, 2022
Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

KoSimCSE Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch SimCSE Installation git clone https://github.com/BM-K/

34 Nov 24, 2022
🎐 a python library for doing approximate and phonetic matching of strings.

jellyfish Jellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk James Turk 1.8k Dec 21, 2022

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

730 Jan 09, 2023
simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Quickly train T5 models in just 3 lines of code + ONNX support simpleT5 is built on top of PyTorch-lightning ⚡️ and Transformers 🤗 that lets you quic

Shivanand Roy 220 Dec 30, 2022
Trex is a tool to match semantically similar functions based on transfer learning.

Trex is a tool to match semantically similar functions based on transfer learning.

62 Dec 28, 2022
Search for documents in a domain through Google. The objective is to extract metadata

MetaFinder - Metadata search through Google _____ __ ___________ .__ .___ / \

Josué Encinar 85 Dec 16, 2022
Text editor on python to convert english text to malayalam(Romanization/Transiteration).

Manglish Text Editor This is a simple transiteration (romanization ) program which is used to convert manglish to malayalam (converts njaan to ഞാൻ ).

Merin Rose Tom 1 May 11, 2022
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Hiring We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on NLP and large-scale pre-traine

Microsoft 7.8k Jan 09, 2023
HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools

HuggingSound HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools. I have no intention of building a very complex tool here.

Jonatas Grosman 247 Dec 26, 2022
Python utility library for compositing PDF documents with reportlab.

pdfdoc-py Python utility library for compositing PDF documents with reportlab. Installation The pdfdoc-py package can be installed directly from the s

Michael Gale 1 Jan 06, 2022
A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

Zhenbang Feng 29 Nov 26, 2022
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

TextAttack 🐙 Generating adversarial examples for NLP models [TextAttack Documentation on ReadTheDocs] About • Setup • Usage • Design About TextAttack

QData 2.2k Jan 03, 2023
Sinkhorn Transformer - Practical implementation of Sparse Sinkhorn Attention

Sinkhorn Transformer This is a reproduction of the work outlined in Sparse Sinkhorn Attention, with additional enhancements. It includes a parameteriz

Phil Wang 217 Nov 25, 2022
🏆 • 5050 most frequent words in 109 languages

🏆 Most Common Words Multilingual 5000 most frequent words in 109 languages. Uses wordfrequency.info as a source. 🔗 License source code license data

14 Nov 24, 2022