Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

Last update: Jan 09, 2023

Related tags

Overview

**Codebase and data are uploaded in progress. **

VOLT(-py) is a vocabulary learning codebase that allows researchers and developers to automaticaly generate a vocabulary with suitable granularity for machine translation.

What's New:

July 2021: Support En-De translation, TED bilingual translation, and multilingual translation.
July 2021: Support subword-nmt tokenization.
July 2021: Support sentencepiece tokenization.

What's On-going:

Add translation training/evaluation codes.
Support classification tasks.
Support pip usage.

Features:

Efficient: CPU learning on one machine.
Simple: The core code is no more than 200 lines.
Easy-to-use: Support widely-used tokenization toolkits,subword-nmt and sentencepiece.
Flexible: User can customize their own tokenization rules.

Requirements and Installation

The required environments:

python 3.0
tqdm
mosedecoder
subword-nmt

To use VOLT and develop locally:

git clone https://github.com/Jingjing-NLP/VOLT/
cd VOLT
git clone https://github.com/moses-smt/mosesdecoder
git clone https://github.com/rsennrich/subword-nmt
pip3 install sentencepiece
pip3 install tqdm

Usage

The first step is to get vocabulary candidates and tokenized texts. The sub-word vocabulary can be generated by subword-nmt and sentencepiece. Here are two examples:


#Assume source_data is the file stroing data in the source language
#Assume target_data is the file stroing data in the target language
BPEROOT=subword-nmt
size=30000 # the size of BPE
cat source_data > training_data
cat target_data >> training_data

#subword-nmt style:
mkdir bpeoutput
BPE_CODE=code # the path to save vocabulary
python3 $BPEROOT/learn_bpe.py -s $size  < training_data > $BPE_CODE
python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < source_file > bpeoutput/source.file
python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < target_file > bpeoutput/source.file

#sentencepiece style:
mkdir spmout
python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$size --character_coverage=1.0 --model_type=bpe
#After this step, you will see spm.vocab and spm.model
python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece
python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece

The second step is to run VOLT scripts. It accepts the following parameters:
- --source_file: the file storing data in the source language.
- --target_file: the file storing data in the target language.
- --token_candidate_file: the file storing token candidates.
- --max_number: the maximum size of the vocabulary generated by VOLT.
- --interval: the search granularity in VOLT.
- --loop_in_ot: the maximum interation loop in sinkhorn solution.
- --tokenizer: which toolkit you use to get vocabulary. Only subword-nmt and sentencepiece are supported.
- --size_file: the file to store the vocabulary size generated by VOLT.
- --threshold: the threshold to decide which tokens are added into the final vocabulary from the optimal matrix. Less threshold means that less token candidates are dropped.
```
#subword-nmt style
python3 ../ot_run.py --source_file bpeoutput/source.file --target_file bpeoutput/target.file \
          --token_candidate_file $BPE_CODE \
          --vocab_file bpeoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer subword-nmt --size_file bpeoutput/size 
#sentencepiece style
python3 ../ot_run.py --source_file spmoutput/source.file --target_file spmoutput/target.file \
          --token_candidate_file $BPE_CODE \
          --vocab_file spmoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer sentencepiece --size_file spmoutput/size 
```

The third step is to use the generated vocabulary to tokenize your texts:

  #for subword-nmt toolkit
  python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < source_file > bpeoutput/source.file
  python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < target_file > bpeoutput/source.file

  #for sentencepiece toolkit, here we only keep the optimal size
  best_size=$(cat spmoutput/size)
  python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$best_size --character_coverage=1.0 --model_type=bpe

  #After this step, you will see spm.vocab and spm.model
  python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece
  python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece

Examples

We have given several examples in path "examples/".

Datasets

The WMT-14 En-de translation data can be downloaed via the running scripts.

For TED, you can download at TED.

Citation

Please cite as:

@inproceedings{volt,
  title = {Vocabulary Learning via Optimal Transport for Neural Machine Translation},
  author= {Jingjing Xu and
               Hao Zhou and
               Chun Gan and
               Zaixiang Zheng and
               Lei Li},
  booktitle = {Proceedings of ACL 2021},
  year = {2021},
}

Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

Related tags

Overview

What's New:

What's On-going:

Features:

Requirements and Installation

Usage

Examples

Datasets

Citation

Owner

PyTorch implementation of "Supervised Contrastive Learning" (and SimCLR incidentally)

This project helps to colorize grayscale images using multiple exemplars.

Pytorch implementation of Hinton's Dynamic Routing Between Capsules

PyTorch experiments with the Zalando fashion-mnist dataset

Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

ACV is a python library that provides explanations for any machine learning model or data.

This repository contains the database and code used in the paper Embedding Arithmetic for Text-driven Image Transformation

TransMorph: Transformer for Medical Image Registration

Vehicles Counting using YOLOv4 + DeepSORT + Flask + Ngrok

Demo notebooks for Qiskit application modules demo sessions (Oct 8 & 15):

PyTorch implementation of ECCV 2020 paper "Foley Music: Learning to Generate Music from Videos "

Lava-DL, but with PyTorch-Lightning flavour

MVGCN: a novel multi-view graph convolutional network (MVGCN) framework for link prediction in biomedical bipartite networks.

BridgeGAN - Tensorflow implementation of Bridging the Gap between Label- and Reference-based Synthesis in Multi-attribute Image-to-Image Translation.

Revisiting Video Saliency: A Large-scale Benchmark and a New Model (CVPR18, PAMI19)

A package for "Procedural Content Generation via Reinforcement Learning" OpenAI Gym interface.

Tightness-aware Evaluation Protocol for Scene Text Detection

Neural models of common sense. 🤖

PyTorch Implementation of ECCV 2020 Spotlight TuiGAN: Learning Versatile Image-to-Image Translation with Two Unpaired Images

[CVPR'21] Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation