Code for EMNLP2021 paper "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training"

Last update: Jul 28, 2022

Related tags

Deep Learning VoCapXLM

Overview

VoCapXLM

Code for EMNLP2021 paper Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Environment

DockerFile: dancingsoul/pytorch:VoCapXLM

Manully build the sentencepiece with following command:

cd sentencepiece
mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo ldconfig -v

Data Preparation

Create a folder with mkdir -p monolingual_text in the root of this project.
Sample monolingual corpus for each language individually, move them to the monolingual_text directory, named after their language codes (e.g., en.txt).
Sample the multilingual corpus from monolingual corpora with the following command:

python sample_multilingual_corpus.py \
    --lang_prob_path ./lang_prob_wiki.json \ 
    --input_dir ./monolingual_text/ \ 
    --output_path ./multilingual_corpus.text \
    --n_sample <n_sample> --beta <beta> --rescale

where the options are described as follows:

--lang_prob_path: the probability of sampling training instances from each language during pre-training, lang_prob_wiki.json is counted on Wikipedia corpus and the probabilities are rescaled with alpha=0.7 from Equation (3) in our paper.
--n_sample: number of sentences in the multilingual corpus where the final multilingual sentencepiece model is trained, the default value is 20000000.
--rescale: further rescale the probability with another value beta from Equation (2) in our paper.
--beta: the rescaling factor in Equation (2), the default value is 0.7.

Training Monolingual SentencePiece Models

Train monolingual sentencepiece models in different sizes to obtain vocabularies with different ALP, i.e., language-specific vocabulary capacity.

python train_mono_spm.py \
    --input_dir ./monolingual_text/ \
    --output_dir ~/monolingual_spm/ \
    --languages <all_languages> \
    --min_vocab_size <min_vocab_size> \
    --max_vocab_size <max_vocab_size> \
    --delta_vocab_size <delta_vocab_size> \
    --n_sample <n_sample>

where the options are described as follows:

--languages: all languages under the monolingual_text directory, separated with ,, e.g. en,fr,zh.
--min_vocab_size: minimum vocabulary size allocated for each language, the default value is 1000.
--max_vocab_size: maximum vocabulary size allocated for each language, the default value is 50000.
--delta_vocab_size: the value of interval to learn vocabularies, the default value is 1000.
--n_sample: the number of sentences to calculate ALP for each language, the default value is 1000000.

or you can download our pre-trained monolingual sentencepiece models and vocabularies from [here][2].

Allocating Multilingual Vocabulary

Allocate the multilingual vocabulary from monolingual vocabularies:

python train_vocap.py \
    --lang_prob_path ./lang_prob_wiki.json \
    --input_dir ./monolingual_spm/ \
    --output_path ./multilingual.vocab \
    --beta <beta> --rescale --target_vocab_size <target_vocab_size>

where the options are described as follows:

--lang_prob_path: same as the above.
--rescale: same as the above.
--beta: same as the above.
--target_vocab_size: the desired vocabulary size of the multilingual vocabulary, the default value is 500000.

Then Use sentencepiece to train the tokenizer given the multilingual vocabulary:

spm_train --input=./multilingual_corpus.text --model_prefix=<model_name> --vocab_size=<target_vocab_size> \
--character_coverage=0.9995 --model_type=unigram --shuffle_input_sentence=true \
--input_sentence_size=<input_sentence_size> --vocab_path=./multilingual.vocab

where the options are described as follows:

--model_prefix: output model name prefix. <model_name>.model and <model_name>.vocab are generated.
--character_coverage: amount of characters covered by the model.
--vocab_size: same as --target_vocab_size.
--vocab_path: the required subwords in the final learned tokenizer.

Paper

Please cite our paper \cite{bo2021vocapxlm} if you found the resources in the repository useful.

@inproceedings{bo2021vocapxlm,
author = {Bo Zheng, Li Dong, Shaohan Huang, Saksham Singhal, Wanxiang Che, Ting Liu, Xia Song, Furu Wei},
booktitle = {Proceedings of EMNLP 2021},
title = {{Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training}},
year = {2021}
}

Code for EMNLP2021 paper "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training"

Related tags

Overview

VoCapXLM

Environment

Data Preparation

Training Monolingual SentencePiece Models

Allocating Multilingual Vocabulary

Paper

Reference

Owner

Bo Zheng

Learning Neural Network Subspaces

A repository for the paper "Improved Adversarial Systems for 3D Object Generation and Reconstruction".

FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes

Hierarchical Memory Matching Network for Video Object Segmentation (ICCV 2021)

Code for Parameter Prediction for Unseen Deep Architectures (NeurIPS 2021)

Unofficial implementation of Proxy Anchor Loss for Deep Metric Learning

Learning Dense Representations of Phrases at Scale (Lee et al., 2020)

Using the provided dataset which includes various book features, in order to predict the price of books, using various proposed methods and models.

i-SpaSP: Structured Neural Pruning via Sparse Signal Recovery

4K videos with annotated masks in our ICCV2021 paper 'Internal Video Inpainting by Implicit Long-range Propagation'.

TransMorph: Transformer for Medical Image Registration

VR-Caps: A Virtual Environment for Active Capsule Endoscopy

The ICS Chat System project for NYU Shanghai Fall 2021

Utility code for use with PyXLL

Custom IMDB Dataset is extracted between 2020-2021 and custom distilBERT model is trained for movie success probability prediction

Nvidia Semantic Segmentation monorepo

Implementation of Fast Transformer in Pytorch

For storing the complete exploration of Visual Question Answering for our B.Tech Project

A Quick and Dirty Progressive Neural Network written in TensorFlow.

Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".