Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

Last update: Jan 09, 2023

Related tags

Overview

**Codebase and data are uploaded in progress. **

VOLT(-py) is a vocabulary learning codebase that allows researchers and developers to automaticaly generate a vocabulary with suitable granularity for machine translation.

What's New:

July 2021: Support En-De translation, TED bilingual translation, and multilingual translation.
July 2021: Support subword-nmt tokenization.
July 2021: Support sentencepiece tokenization.

What's On-going:

Add translation training/evaluation codes.
Support classification tasks.
Support pip usage.

Features:

Efficient: CPU learning on one machine.
Simple: The core code is no more than 200 lines.
Easy-to-use: Support widely-used tokenization toolkits,subword-nmt and sentencepiece.
Flexible: User can customize their own tokenization rules.

Requirements and Installation

The required environments:

python 3.0
tqdm
mosedecoder
subword-nmt

To use VOLT and develop locally:

git clone https://github.com/Jingjing-NLP/VOLT/
cd VOLT
git clone https://github.com/moses-smt/mosesdecoder
git clone https://github.com/rsennrich/subword-nmt
pip3 install sentencepiece
pip3 install tqdm

Usage

The first step is to get vocabulary candidates and tokenized texts. The sub-word vocabulary can be generated by subword-nmt and sentencepiece. Here are two examples:


#Assume source_data is the file stroing data in the source language
#Assume target_data is the file stroing data in the target language
BPEROOT=subword-nmt
size=30000 # the size of BPE
cat source_data > training_data
cat target_data >> training_data

#subword-nmt style:
mkdir bpeoutput
BPE_CODE=code # the path to save vocabulary
python3 $BPEROOT/learn_bpe.py -s $size  < training_data > $BPE_CODE
python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < source_file > bpeoutput/source.file
python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < target_file > bpeoutput/source.file

#sentencepiece style:
mkdir spmout
python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$size --character_coverage=1.0 --model_type=bpe
#After this step, you will see spm.vocab and spm.model
python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece
python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece

The second step is to run VOLT scripts. It accepts the following parameters:
- --source_file: the file storing data in the source language.
- --target_file: the file storing data in the target language.
- --token_candidate_file: the file storing token candidates.
- --max_number: the maximum size of the vocabulary generated by VOLT.
- --interval: the search granularity in VOLT.
- --loop_in_ot: the maximum interation loop in sinkhorn solution.
- --tokenizer: which toolkit you use to get vocabulary. Only subword-nmt and sentencepiece are supported.
- --size_file: the file to store the vocabulary size generated by VOLT.
- --threshold: the threshold to decide which tokens are added into the final vocabulary from the optimal matrix. Less threshold means that less token candidates are dropped.
```
#subword-nmt style
python3 ../ot_run.py --source_file bpeoutput/source.file --target_file bpeoutput/target.file \
          --token_candidate_file $BPE_CODE \
          --vocab_file bpeoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer subword-nmt --size_file bpeoutput/size 
#sentencepiece style
python3 ../ot_run.py --source_file spmoutput/source.file --target_file spmoutput/target.file \
          --token_candidate_file $BPE_CODE \
          --vocab_file spmoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer sentencepiece --size_file spmoutput/size 
```

The third step is to use the generated vocabulary to tokenize your texts:

  #for subword-nmt toolkit
  python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < source_file > bpeoutput/source.file
  python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < target_file > bpeoutput/source.file

  #for sentencepiece toolkit, here we only keep the optimal size
  best_size=$(cat spmoutput/size)
  python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$best_size --character_coverage=1.0 --model_type=bpe

  #After this step, you will see spm.vocab and spm.model
  python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece
  python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece

Examples

We have given several examples in path "examples/".

Datasets

The WMT-14 En-de translation data can be downloaed via the running scripts.

For TED, you can download at TED.

Citation

Please cite as:

@inproceedings{volt,
  title = {Vocabulary Learning via Optimal Transport for Neural Machine Translation},
  author= {Jingjing Xu and
               Hao Zhou and
               Chun Gan and
               Zaixiang Zheng and
               Lei Li},
  booktitle = {Proceedings of ACL 2021},
  year = {2021},
}

Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

Related tags

Overview

What's New:

What's On-going:

Features:

Requirements and Installation

Usage

Examples

Datasets

Citation

Owner

Implementation of "RaScaNet: Learning Tiny Models by Raster-Scanning Image" from CVPR 2021.

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".

BasicNeuralNetwork - This project looks over the basic structure of a neural network and how machine learning training algorithms work

the official implementation of the paper "Isometric Multi-Shape Matching" (CVPR 2021)

Individual Tree Crown classification on WorldView-2 Images using Autoencoder -- Group 9 Weak learners - Final Project (Machine Learning 2020 Course)

Hierarchical Few-Shot Generative Models

Unofficial Implement PU-Transformer

Dieser Scanner findet Websites, die nicht direkt in Suchmaschinen auftauchen, aber trotzdem erreichbar sind.

Official PyTorch Implementation of Rank & Sort Loss [ICCV2021]

내가 보려고 정리한 <프로그래밍 기초 Ⅰ> / organized for me

Space Invaders For Python

Automatic 2D-to-3D Video Conversion with CNNs

Notification Triggers for Python

Code for CPM-2 Pre-Train

Official implementation of Protected Attribute Suppression System, ICCV 2021

PyTorch implementation of our method for adversarial attacks and defenses in hyperspectral image classification.

Exploiting Robust Unsupervised Video Person Re-identification

traiNNer is an open source image and video restoration (super-resolution, denoising, deblurring and others) and image to image translation toolbox based on PyTorch.

Powerful unsupervised domain adaptation method for dense retrieval.

MultiTaskLearning - Multi Task Learning for 3D segmentation