Code for EMNLP2021 paper "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training"

Related tags

Deep LearningVoCapXLM
Overview

VoCapXLM

Code for EMNLP2021 paper Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training

Environment

DockerFile: dancingsoul/pytorch:VoCapXLM

Manully build the sentencepiece with following command:

cd sentencepiece
mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo ldconfig -v

Data Preparation

  1. Create a folder with mkdir -p monolingual_text in the root of this project.
  2. Sample monolingual corpus for each language individually, move them to the monolingual_text directory, named after their language codes (e.g., en.txt).
  3. Sample the multilingual corpus from monolingual corpora with the following command:
python sample_multilingual_corpus.py \
    --lang_prob_path ./lang_prob_wiki.json \ 
    --input_dir ./monolingual_text/ \ 
    --output_path ./multilingual_corpus.text \
    --n_sample <n_sample> --beta <beta> --rescale

where the options are described as follows:

  • --lang_prob_path: the probability of sampling training instances from each language during pre-training, lang_prob_wiki.json is counted on Wikipedia corpus and the probabilities are rescaled with alpha=0.7 from Equation (3) in our paper.
  • --n_sample: number of sentences in the multilingual corpus where the final multilingual sentencepiece model is trained, the default value is 20000000.
  • --rescale: further rescale the probability with another value beta from Equation (2) in our paper.
  • --beta: the rescaling factor in Equation (2), the default value is 0.7.

Training Monolingual SentencePiece Models

Train monolingual sentencepiece models in different sizes to obtain vocabularies with different ALP, i.e., language-specific vocabulary capacity.

python train_mono_spm.py \
    --input_dir ./monolingual_text/ \
    --output_dir ~/monolingual_spm/ \
    --languages <all_languages> \
    --min_vocab_size <min_vocab_size> \
    --max_vocab_size <max_vocab_size> \
    --delta_vocab_size <delta_vocab_size> \
    --n_sample <n_sample>

where the options are described as follows:

  • --languages: all languages under the monolingual_text directory, separated with ,, e.g. en,fr,zh.
  • --min_vocab_size: minimum vocabulary size allocated for each language, the default value is 1000.
  • --max_vocab_size: maximum vocabulary size allocated for each language, the default value is 50000.
  • --delta_vocab_size: the value of interval to learn vocabularies, the default value is 1000.
  • --n_sample: the number of sentences to calculate ALP for each language, the default value is 1000000.

or you can download our pre-trained monolingual sentencepiece models and vocabularies from [here][2].

Allocating Multilingual Vocabulary

Allocate the multilingual vocabulary from monolingual vocabularies:

python train_vocap.py \
    --lang_prob_path ./lang_prob_wiki.json \
    --input_dir ./monolingual_spm/ \
    --output_path ./multilingual.vocab \
    --beta <beta> --rescale --target_vocab_size <target_vocab_size>

where the options are described as follows:

  • --lang_prob_path: same as the above.
  • --rescale: same as the above.
  • --beta: same as the above.
  • --target_vocab_size: the desired vocabulary size of the multilingual vocabulary, the default value is 500000.

Then Use sentencepiece to train the tokenizer given the multilingual vocabulary:

spm_train --input=./multilingual_corpus.text --model_prefix=<model_name> --vocab_size=<target_vocab_size> \
--character_coverage=0.9995 --model_type=unigram --shuffle_input_sentence=true \
--input_sentence_size=<input_sentence_size> --vocab_path=./multilingual.vocab

where the options are described as follows:

  • --model_prefix: output model name prefix. <model_name>.model and <model_name>.vocab are generated.
  • --character_coverage: amount of characters covered by the model.
  • --vocab_size: same as --target_vocab_size.
  • --vocab_path: the required subwords in the final learned tokenizer.

Paper

Please cite our paper \cite{bo2021vocapxlm} if you found the resources in the repository useful.

@inproceedings{bo2021vocapxlm,
author = {Bo Zheng, Li Dong, Shaohan Huang, Saksham Singhal, Wanxiang Che, Ting Liu, Xia Song, Furu Wei},
booktitle = {Proceedings of EMNLP 2021},
title = {{Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training}},
year = {2021}
}

Reference

  1. https://github.com/google/sentencepiece
  2. https://drive.google.com/file/d/1VttgE30xo-i1ig5xsMF_7R4AB2sA5J9F/view?usp=sharing
Owner
Bo Zheng
Bo Zheng
An implementation of Deep Forest 2021.2.1.

Deep Forest (DF) 21 DF21 is an implementation of Deep Forest 2021.2.1. It is designed to have the following advantages: Powerful: Better accuracy than

LAMDA Group, Nanjing University 795 Jan 03, 2023
High performance Cross-platform Inference-engine, you could run Anakin on x86-cpu,arm, nv-gpu, amd-gpu,bitmain and cambricon devices.

Anakin2.0 Welcome to the Anakin GitHub. Anakin is a cross-platform, high-performance inference engine, which is originally developed by Baidu engineer

514 Dec 28, 2022
Doing fast searching of nearest neighbors in high dimensional spaces is an increasingly important problem

Benchmarking nearest neighbors Doing fast searching of nearest neighbors in high dimensional spaces is an increasingly important problem, but so far t

Erik Bernhardsson 3.2k Jan 03, 2023
Assginment for UofT CSC420: Intro to Image Understanding

Run the code Open edge_detection.ipynb in google colab. Upload image1.jpg,image2.jpg and my_image.jpg to '/content/drive/My Drive'. chooose 'Run all'

Ziyi-Zhou 1 Feb 24, 2022
This repo provides a demo for the CVPR 2021 paper "A Fourier-based Framework for Domain Generalization" on the PACS dataset.

FACT This repo provides a demo for the CVPR 2021 paper "A Fourier-based Framework for Domain Generalization" on the PACS dataset. To cite, please use:

105 Dec 17, 2022
Predicts an answer in yes or no.

Oui-ou-non-prediction Predicts an answer in 'yes' or 'no'. It is based on the game 'effeuiller la marguerite' in which the person plucks flower petals

Ananya Gupta 1 Jan 15, 2022
Adversarial Graph Representation Adaptation for Cross-Domain Facial Expression Recognition (AGRA, ACM 2020, Oral)

Cross Domain Facial Expression Recognition Benchmark Implementation of papers: Cross-Domain Facial Expression Recognition: A Unified Evaluation Benchm

89 Dec 09, 2022
Walk with fastai

Shield: This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Walk with fastai What is this p

Walk with fastai 124 Dec 10, 2022
Deep Learning and Logical Reasoning from Data and Knowledge

Logic Tensor Networks (LTN) Logic Tensor Network (LTN) is a neurosymbolic framework that supports querying, learning and reasoning with both rich data

171 Dec 29, 2022
Simple PyTorch hierarchical models.

A python package adding basic hierarchal networks in pytorch for classification tasks. It implements a simple hierarchal network structure based on feed-backward outputs.

Rajiv Sarvepalli 5 Mar 06, 2022
Neural Network Libraries

Neural Network Libraries Neural Network Libraries is a deep learning framework that is intended to be used for research, development and production. W

Sony 2.6k Dec 30, 2022
HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR. CVPR 2022

HSC4D: Human-centered 4D Scene Capture in Large-scale Indoor-outdoor Space Using Wearable IMUs and LiDAR. CVPR 2022 [Project page | Video] Getting sta

51 Nov 29, 2022
Gesture-Volume-Control - This Python program can adjust the system's volume by using hand gestures

Gesture-Volume-Control This Python program can adjust the system's volume by usi

VatsalAryanBhatanagar 1 Dec 30, 2021
Implicit MLE: Backpropagating Through Discrete Exponential Family Distributions

torch-imle Concise and self-contained PyTorch library implementing the I-MLE gradient estimator proposed in our NeurIPS 2021 paper Implicit MLE: Backp

UCL Natural Language Processing 249 Jan 03, 2023
Video Autoencoder: self-supervised disentanglement of 3D structure and motion

Video Autoencoder: self-supervised disentanglement of 3D structure and motion This repository contains the code (in PyTorch) for the model introduced

157 Dec 22, 2022
The Python code for the paper A Hybrid Quantum-Classical Algorithm for Robust Fitting

About The Python code for the paper A Hybrid Quantum-Classical Algorithm for Robust Fitting The demo program was only tested under Conda in a standard

Anh-Dzung Doan 5 Nov 28, 2022
Python script for performing depth completion from sparse depth and rgb images using the msg_chn_wacv20. model in Tensorflow Lite.

TFLite-msg_chn_wacv20-depth-completion Python script for performing depth completion from sparse depth and rgb images using the msg_chn_wacv20. model

Ibai Gorordo 2 Oct 04, 2021
Credit fraud detection in Python using a Jupyter Notebook

Credit-Fraud-Detection - Credit fraud detection in Python using a Jupyter Notebook , using three classification models (Random Forest, Gaussian Naive Bayes, Logistic Regression) from the sklearn libr

Ali Akram 4 Dec 28, 2021
Spatial-Temporal Transformer for Dynamic Scene Graph Generation, ICCV2021

Spatial-Temporal Transformer for Dynamic Scene Graph Generation Pytorch Implementation of our paper Spatial-Temporal Transformer for Dynamic Scene Gra

Yuren Cong 119 Jan 01, 2023
Just Go with the Flow: Self-Supervised Scene Flow Estimation

Just Go with the Flow: Self-Supervised Scene Flow Estimation Code release for the paper Just Go with the Flow: Self-Supervised Scene Flow Estimation,

Himangi Mittal 50 Nov 22, 2022