XLNet: Generalized Autoregressive Pretraining for Language Understanding

Overview

Introduction

XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking.

For a detailed description of technical details and experimental results, please refer to our paper:

XLNet: Generalized Autoregressive Pretraining for Language Understanding

​ Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le

​ (*: equal contribution)

​ Preprint 2019

Release Notes

  • July 16, 2019: XLNet-Base.
  • June 19, 2019: initial release with XLNet-Large and code.

Results

As of June 19, 2019, XLNet outperforms BERT on 20 tasks and achieves state-of-the-art results on 18 tasks. Below are some comparison between XLNet-Large and BERT-Large, which have similar model sizes:

Results on Reading Comprehension

Model RACE accuracy SQuAD1.1 EM SQuAD2.0 EM
BERT-Large 72.0 84.1 78.98
XLNet-Base 80.18
XLNet-Large 81.75 88.95 86.12

We use SQuAD dev results in the table to exclude other factors such as using additional training data or other data augmentation techniques. See SQuAD leaderboard for test numbers.

Results on Text Classification

Model IMDB Yelp-2 Yelp-5 DBpedia Amazon-2 Amazon-5
BERT-Large 4.51 1.89 29.32 0.64 2.63 34.17
XLNet-Large 3.79 1.55 27.80 0.62 2.40 32.26

The above numbers are error rates.

Results on GLUE

Model MNLI QNLI QQP RTE SST-2 MRPC CoLA STS-B
BERT-Large 86.6 92.3 91.3 70.4 93.2 88.0 60.6 90.0
XLNet-Base 86.8 91.7 91.4 74.0 94.7 88.2 60.2 89.5
XLNet-Large 89.8 93.9 91.8 83.8 95.6 89.2 63.6 91.8

We use single-task dev results in the table to exclude other factors such as multi-task learning or using ensembles.

Pre-trained models

Released Models

As of July 16, 2019, the following models have been made available:

  • XLNet-Large, Cased: 24-layer, 1024-hidden, 16-heads
  • XLNet-Base, Cased: 12-layer, 768-hidden, 12-heads. This model is trained on full data (different from the one in the paper).

We only release cased models for now because on the tasks we consider, we found: (1) for the base setting, cased and uncased models have similar performance; (2) for the large setting, cased models are a bit better in some tasks.

Each .zip file contains three items:

  • A TensorFlow checkpoint (xlnet_model.ckpt) containing the pre-trained weights (which is actually 3 files).
  • A Sentence Piece model (spiece.model) used for (de)tokenization.
  • A config file (xlnet_config.json) which specifies the hyperparameters of the model.

Future Release Plan

We also plan to continuously release more pretrained models under different settings, including:

  • A pretrained model that is finetuned on Wikipedia. This can be used for tasks with Wikipedia text such as SQuAD and HotpotQA.
  • Pretrained models with other hyperparameter configurations, targeting specific downstream tasks.
  • Pretrained models that benefit from new techniques.

Subscribing to XLNet on Google Groups

To receive notifications about updates, announcements and new releases, we recommend subscribing to the XLNet on Google Groups.

Fine-tuning with XLNet

As of June 19, 2019, this code base has been tested with TensorFlow 1.13.1 under Python2.

Memory Issue during Finetuning

  • Most of the SOTA results in our paper were produced on TPUs, which generally have more RAM than common GPUs. As a result, it is currently very difficult (costly) to re-produce most of the XLNet-Large SOTA results in the paper using GPUs with 12GB - 16GB of RAM, because a 16GB GPU is only able to hold a single sequence with length 512 for XLNet-Large. Therefore, a large number (ranging from 32 to 128, equal to batch_size) of GPUs are required to reproduce many results in the paper.
  • We are experimenting with gradient accumulation to potentially relieve the memory burden, which could be included in a near-future update.
  • Alternative methods of finetuning XLNet on constrained hardware have been presented in renatoviolin's repo, which obtained 86.24 F1 on SQuAD2.0 with a 8GB memory GPU.

Given the memory issue mentioned above, using the default finetuning scripts (run_classifier.py and run_squad.py), we benchmarked the maximum batch size on a single 16GB GPU with TensorFlow 1.13.1:

System Seq Length Max Batch Size
XLNet-Base 64 120
... 128 56
... 256 24
... 512 8
XLNet-Large 64 16
... 128 8
... 256 2
... 512 1

In most cases, it is possible to reduce the batch size train_batch_size or the maximum sequence length max_seq_length to fit in given hardware. The decrease in performance depends on the task and the available resources.

Text Classification/Regression

The code used to perform classification/regression finetuning is in run_classifier.py. It also contains examples for standard one-document classification, one-document regression, and document pair classification. Here, we provide two concrete examples of how run_classifier.py can be used.

From here on, we assume XLNet-Large and XLNet-base has been downloaded to $LARGE_DIR and $BASE_DIR respectively.

(1) STS-B: sentence pair relevance regression (with GPUs)

  • Download the GLUE data by running this script and unpack it to some directory $GLUE_DIR.

  • Perform multi-GPU (4 V100 GPUs) finetuning with XLNet-Large by running

    CUDA_VISIBLE_DEVICES=0,1,2,3 python run_classifier.py \
      --do_train=True \
      --do_eval=False \
      --task_name=sts-b \
      --data_dir=${GLUE_DIR}/STS-B \
      --output_dir=proc_data/sts-b \
      --model_dir=exp/sts-b \
      --uncased=False \
      --spiece_model_file=${LARGE_DIR}/spiece.model \
      --model_config_path=${LARGE_DIR}/xlnet_config.json \
      --init_checkpoint=${LARGE_DIR}/xlnet_model.ckpt \
      --max_seq_length=128 \
      --train_batch_size=8 \
      --num_hosts=1 \
      --num_core_per_host=4 \
      --learning_rate=5e-5 \
      --train_steps=1200 \
      --warmup_steps=120 \
      --save_steps=600 \
      --is_regression=True
  • Evaluate the finetuning results with a single GPU by

    CUDA_VISIBLE_DEVICES=0 python run_classifier.py \
      --do_train=False \
      --do_eval=True \
      --task_name=sts-b \
      --data_dir=${GLUE_DIR}/STS-B \
      --output_dir=proc_data/sts-b \
      --model_dir=exp/sts-b \
      --uncased=False \
      --spiece_model_file=${LARGE_DIR}/spiece.model \
      --model_config_path=${LARGE_DIR}/xlnet_config.json \
      --max_seq_length=128 \
      --eval_batch_size=8 \
      --num_hosts=1 \
      --num_core_per_host=1 \
      --eval_all_ckpt=True \
      --is_regression=True
    
    # Expected performance: "eval_pearsonr 0.916+ "

Notes:

  • In the context of GPU training, num_core_per_host denotes the number of GPUs to use.
  • In the multi-GPU setting, train_batch_size refers to the per-GPU batch size.
  • eval_all_ckpt allows one to evaluate all saved checkpoints (save frequency is controlled by save_steps) after training finishes and choose the best model based on dev performance.
  • data_dir and output_dir refer to the directories of the "raw data" and "preprocessed tfrecords" respectively, while model_dir is the working directory for saving checkpoints and tensorflow events. model_dir should be set as a separate folder to init_checkpoint.
  • To try out XLNet-base, one can simply set --train_batch_size=32 and --num_core_per_host=1, along with according changes in init_checkpoint and model_config_path.
  • For GPUs with smaller RAM, please proportionally decrease the train_batch_size and increase num_core_per_host to use the same training setting.
  • Important: we separate the training and evaluation into "two phases", as using multi GPUs to perform evaluation is tricky (one has to correctly separate the data across GPUs). To ensure correctness, we only support single-GPU evaluation for now.

(2) IMDB: movie review sentiment classification (with TPU V3-8)

  • Download and unpack the IMDB dataset by running

    wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    tar zxvf aclImdb_v1.tar.gz
  • Launch a Google cloud TPU V3-8 instance (see the Google Cloud TPU tutorial for how to set up Cloud TPUs).

  • Set up your Google storage bucket path $GS_ROOT and move the IMDB dataset and pretrained checkpoint into your Google storage.

  • Perform TPU finetuning with XLNet-Large by running

    python run_classifier.py \
      --use_tpu=True \
      --tpu=${TPU_NAME} \
      --do_train=True \
      --do_eval=True \
      --eval_all_ckpt=True \
      --task_name=imdb \
      --data_dir=${IMDB_DIR} \
      --output_dir=${GS_ROOT}/proc_data/imdb \
      --model_dir=${GS_ROOT}/exp/imdb \
      --uncased=False \
      --spiece_model_file=${LARGE_DIR}/spiece.model \
      --model_config_path=${GS_ROOT}/${LARGE_DIR}/model_config.json \
      --init_checkpoint=${GS_ROOT}/${LARGE_DIR}/xlnet_model.ckpt \
      --max_seq_length=512 \
      --train_batch_size=32 \
      --eval_batch_size=8 \
      --num_hosts=1 \
      --num_core_per_host=8 \
      --learning_rate=2e-5 \
      --train_steps=4000 \
      --warmup_steps=500 \
      --save_steps=500 \
      --iterations=500
    
    # Expected performance: "eval_accuracy 0.962+ "

Notes:

  • To obtain the SOTA on the IMDB dataset, using sequence length 512 is necessary. Therefore, we show how this can be done with a TPU V3-8.
  • Alternatively, one can use a sequence length smaller than 512, a smaller batch size, or switch to XLNet-base to train on GPUs. But performance drop is expected.
  • Notice that the data_dir and spiece_model_file both use a local path rather than a Google Storage path. The reason is that data preprocessing is actually performed locally. Hence, using local paths leads to a faster preprocessing speed.

SQuAD2.0

The code for the SQuAD dataset is included in run_squad.py.

To run the code:

(1) Download the SQuAD2.0 dataset into $SQUAD_DIR by:

mkdir -p ${SQUAD_DIR} && cd ${SQUAD_DIR}
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

(2) Perform data preprocessing using the script scripts/prepro_squad.sh.

  • This will take quite some time in order to accurately map character positions (raw data) to sentence piece positions (used for training).

  • For faster parallel preprocessing, please refer to the flags --num_proc and --proc_id in run_squad.py.

(3) Perform training and evaluation.

For the best performance, XLNet-Large uses sequence length 512 and batch size 48 for training.

  • As a result, reproducing the best result with GPUs is quite difficult.

  • For training with one TPU v3-8, one can simply run the script scripts/tpu_squad_large.sh after both the TPU and Google storage have been setup.

  • run_squad.py will automatically perform threshold searching on the dev set of squad and output the score. With scripts/tpu_squad_large.sh, the expected F1 score should be around 88.6 (median of our multiple runs).

Alternatively, one can use XLNet-Base with GPUs (e.g. three V100). One set of reasonable hyper-parameters can be found in the script scripts/gpu_squad_base.sh.

RACE reading comprehension

The code for the reading comprehension task RACE is included in run_race.py.

  • Notably, the average length of the passages in RACE is over 300 tokens (not peices), which is significantly longer than other popular reading comprehension datasets such as SQuAD.
  • Also, many questions can be very difficult and requires complex reasoning for machines to solve (see one example here).

To run the code:

(1) Download the RACE dataset from the official website and unpack the raw data to $RACE_DIR.

(2) Perform training and evaluation:

  • The SOTA performance (accuracy 81.75) of RACE is produced using XLNet-Large with sequence length 512 and batch size 32, which requires a large TPU v3-32 in the pod setting. Please refer to the script script/tpu_race_large_bsz32.sh for this setting.
  • Using XLNet-Large with sequence length 512 and batch size 8 on a TPU v3-8 can give you an accuracy of around 80.3 (see script/tpu_race_large_bsz8.sh).

Using Google Colab

An example of using Google Colab with GPUs has been provided. Note that since the hardware is constrained in the example, the results are worse than the best we can get. It mainly serves as an example and should be modified accordingly to maximize performance.

Custom Usage of XLNet

XLNet Abstraction

For finetuning, it is likely that you will be able to modify existing files such as run_classifier.py, run_squad.py and run_race.py for your task at hand. However, we also provide an abstraction of XLNet to enable more flexible usage. Below is an example:

import xlnet

# some code omitted here...
# initialize FLAGS
# initialize instances of tf.Tensor, including input_ids, seg_ids, and input_mask

# XLNetConfig contains hyperparameters that are specific to a model checkpoint.
xlnet_config = xlnet.XLNetConfig(json_path=FLAGS.model_config_path)

# RunConfig contains hyperparameters that could be different between pretraining and finetuning.
run_config = xlnet.create_run_config(is_training=True, is_finetune=True, FLAGS=FLAGS)

# Construct an XLNet model
xlnet_model = xlnet.XLNetModel(
    xlnet_config=xlnet_config,
    run_config=run_config,
    input_ids=input_ids,
    seg_ids=seg_ids,
    input_mask=input_mask)

# Get a summary of the sequence using the last hidden state
summary = xlnet_model.get_pooled_out(summary_type="last")

# Get a sequence output
seq_out = xlnet_model.get_sequence_output()

# build your applications based on `summary` or `seq_out`

Tokenization

Below is an example of doing tokenization in XLNet:

import sentencepiece as spm
from prepro_utils import preprocess_text, encode_ids

# some code omitted here...
# initialize FLAGS

text = "An input text string."

sp_model = spm.SentencePieceProcessor()
sp_model.Load(FLAGS.spiece_model_file)
text = preprocess_text(text, lower=FLAGS.uncased)
ids = encode_ids(sp_model, text)

where FLAGS.spiece_model_file is the SentencePiece model file in the same zip as the pretrained model, FLAGS.uncased is a bool indicating whether to do uncasing.

Pretraining with XLNet

Refer to train.py for pretraining on TPUs and train_gpu.py for pretraining on GPUs. First we need to preprocess the text data into tfrecords.

python data_utils.py \
	--bsz_per_host=32 \
	--num_core_per_host=16 \
	--seq_len=512 \
	--reuse_len=256 \
	--input_glob=*.txt \
	--save_dir=${SAVE_DIR} \
	--num_passes=20 \
	--bi_data=True \
	--sp_path=spiece.model \
	--mask_alpha=6 \
	--mask_beta=1 \
	--num_predict=85

where input_glob defines all input text files, save_dir is the output directory for tfrecords, and sp_path is a Sentence Piece model. Here is our script to train the Sentence Piece model

spm_train \
	--input=$INPUT \
	--model_prefix=sp10m.cased.v3 \
	--vocab_size=32000 \
	--character_coverage=0.99995 \
	--model_type=unigram \
	--control_symbols=<cls>,<sep>,<pad>,<mask>,<eod> \
	--user_defined_symbols=<eop>,.,(,),",-,–,£,€ \
	--shuffle_input_sentence \
	--input_sentence_size=10000000

Special symbols are used, including control_symbols and user_defined_symbols. We use and to denote End of Paragraph and End of Document respectively.

The input text files to data_utils.py must use the following format:

  • Each line is a sentence.
  • An empty line means End of Document.
  • (Optional) If one also wants to model paragraph structures, can be inserted at the end of certain lines (without any space) to indicate that the corresponding sentence ends a paragraph.

For example, the text input file could be:

This is the first sentence.
This is the second sentence and also the end of the paragraph.
   
    
Another paragraph.

Another document starts here.

   

After preprocessing, we are ready to pretrain an XLNet. Below are the hyperparameters used for pretraining XLNet-Large:

python train.py
  --record_info_dir=$DATA/tfrecords \
  --train_batch_size=2048 \
  --seq_len=512 \
  --reuse_len=256 \
  --mem_len=384 \
  --perm_size=256 \
  --n_layer=24 \
  --d_model=1024 \
  --d_embed=1024 \
  --n_head=16 \
  --d_head=64 \
  --d_inner=4096 \
  --untie_r=True \
  --mask_alpha=6 \
  --mask_beta=1 \
  --num_predict=85

where we only list the most important flags and the other flags could be adjusted based on specific use cases.

Owner
Zihang Dai
PhD student in CMU
Zihang Dai
Data manipulation and transformation for audio signal processing, powered by PyTorch

torchaudio: an audio library for PyTorch The aim of torchaudio is to apply PyTorch to the audio domain. By supporting PyTorch, torchaudio follows the

1.9k Jan 08, 2023
A deep learning-based translation library built on Huggingface transformers

DL Translate A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large 💻 GitHub Repository 📚 Documentat

Xing Han Lu 244 Dec 30, 2022
BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

BROS (BERT Relying On Spatiality) is a pre-trained language model focusing on text and layout for better key information extraction from documents. Given the OCR results of the document image, which

Clova AI Research 94 Dec 30, 2022
Pipeline for training LSA models using Scikit-Learn.

Latent Semantic Analysis Pipeline for training LSA models using Scikit-Learn. Usage Instead of writing custom code for latent semantic analysis, you j

Dani El-Ayyass 23 Sep 05, 2022
A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

Reddit text to speech generator A basic reddit tts video generator Current functionality Generate videos for subs based on comments,(askreddit) so rea

Aadvik 17 Dec 19, 2022
This is a Prototype of an Ai ChatBot "Tea and Coffee Supplier" using python.

Ai-ChatBot-Python A chatbot is an intelligent system which can hold a conversation with a human using natural language in real time. Due to the rise o

1 Oct 30, 2021
State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Trapper (Transformers wRAPPER) Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps h

Open Business Software Solutions 42 Sep 21, 2022
Utilizing RBERT model for KLUE Relation Extraction task

RBERT for Relation Extraction task for KLUE Project Description Relation Extraction task is one of the task of Korean Language Understanding Evaluatio

snoop2head 14 Nov 15, 2022
Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

Auto-Research A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting arti

Sidharth Pal 20 Dec 14, 2022
超轻量级bert的pytorch版本,大量中文注释,容易修改结构,持续更新

bert4pytorch 2021年8月27更新: 感谢大家的star,最近有小伙伴反映了一些小的bug,我也注意到了,奈何这个月工作上实在太忙,更新不及时,大约会在9月中旬集中更新一个只需要pip一下就完全可用的版本,然后会新添加一些关键注释。 再增加对抗训练的内容,更新一个完整的finetune

muqiu 317 Dec 18, 2022
translate using your voice

speech-to-text-translator Usage translate using your voice description this project makes translating a word easy, all you have to do is speak and...

1 Oct 18, 2021
A library for Multilingual Unsupervised or Supervised word Embeddings

MUSE: Multilingual Unsupervised and Supervised Embeddings MUSE is a Python library for multilingual word embeddings, whose goal is to provide the comm

Facebook Research 3k Jan 06, 2023
Diaformer: Automatic Diagnosis via Symptoms Sequence Generation

Diaformer Diaformer: Automatic Diagnosis via Symptoms Sequence Generation (AAAI 2022) Diaformer is an efficient model for automatic diagnosis via symp

Junying Chen 20 Dec 13, 2022
BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

Bithiah Yuan 61 Sep 18, 2022
A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

Emily's Symbol Dictionary Design This dictionary was created with the following goals in mind: Have a consistent method to type (pretty much) every sy

Emily 68 Jan 07, 2023
LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation Tasks | Datasets | LongLM | Baselines | Paper Introduction LOT is a ben

46 Dec 28, 2022
LSTM model - IMDB review sentiment analysis

NLP - Movie review sentiment analysis The colab notebook contains the code for building a LSTM Recurrent Neural Network that gives 87-88% accuracy on

Sundeep Bhimireddy 1 Jan 29, 2022
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 549 Jan 06, 2023
【原神】自动演奏风物之诗琴的程序

疯物之诗琴 读取midi并自动演奏原神风物之诗琴。 可以自定义配置文件自动调整音符来适配风物之诗琴。 (原神1.4直播那天就开始做了!到现在才能放出来。。) 如何使用 在Release页面中下载打包好的程序和midi压缩包并解压。 双击运行“疯物之诗琴.exe”。 在原神中打开风物之诗琴,软件内输入

435 Jan 04, 2023