Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

Last update: Jan 01, 2023

Related tags

Deep Learning mRASP2

Overview

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

The code for training mCOLT/mRASP2, a multilingual NMT training framework, implemented based on fairseq.

mRASP2: paper

mRASP: paper, code

News

We have released two versions, this version is the original one. In this implementation:

You should first merge all data, by pre-pending language token before each sentence to indicate the language.
AA/RAS muse be done off-line (before binarize), check this toolkit.

New implementation: https://github.com/PANXiao1994/mRASP2/tree/new_impl

Acknowledgement: This work is supported by Bytedance. We thank Chengqi for uploading all files and checkpoints.

Introduction

mRASP2/mCOLT, representing multilingual Contrastive Learning for Transformer, is a multilingual neural machine translation model that supports complete many-to-many multilingual machine translation. It employs both parallel corpora and multilingual corpora in a unified training framework. For detailed information please refer to the paper.

Pre-requisite

pip install -r requirements.txt

Training Data and Checkpoints

We release our preprocessed training data and checkpoints in the following.

Dataset

We merge 32 English-centric language pairs, resulting in 64 directed translation pairs in total. The original 32 language pairs corpus contains about 197M pairs of sentences. We get about 262M pairs of sentences after applying RAS, since we keep both the original sentences and the substituted sentences. We release both the original dataset and dataset after applying RAS.

Dataset	#Pair
32-lang-pairs-TRAIN	197603294
32-lang-pairs-RAS-TRAIN	262662792
mono-split-a	-
mono-split-b	-
mono-split-c	-
mono-split-d	-
mono-split-e	-
mono-split-de-fr-en	-
mono-split-nl-pl-pt	-
32-lang-pairs-DEV-en-centric	-
32-lang-pairs-DEV-many-to-many	-
Vocab	-
BPE Code	-

Checkpoints & Results

Please note that the provided checkpoint is sightly different from that in the paper. In the following sections, we report the results of the provided checkpoints.

English-centric Directions

We report tokenized BLEU in the following table. (check eval.sh for details)

	6e6d-no-mono	12e12d-no-mono	12e12d
en2cs/wmt16	21.0	22.3	23.8
cs2en/wmt16	29.6	32.4	33.2
en2fr/wmt14	42.0	43.3	43.4
fr2en/wmt14	37.8	39.3	39.5
en2de/wmt14	27.4	29.2	29.5
de2en/wmt14	32.2	34.9	35.2
en2zh/wmt17	33.0	34.9	34.1
zh2en/wmt17	22.4	24.0	24.4
en2ro/wmt16	26.6	28.1	28.7
ro2en/wmt16	36.8	39.0	39.1
en2tr/wmt16	18.6	20.3	21.2
tr2en/wmt16	22.2	25.5	26.1
en2ru/wmt19	17.4	18.5	19.2
ru2en/wmt19	22.0	23.2	23.6
en2fi/wmt17	20.2	22.1	22.9
fi2en/wmt17	26.1	29.5	29.7
en2es/wmt13	32.8	34.1	34.6
es2en/wmt13	32.8	34.6	34.7
en2it/wmt09	28.9	30.0	30.8
it2en/wmt09	31.4	32.7	32.8

Unsupervised Directions

We report tokenized BLEU in the following table. (check eval.sh for details)

	12e12d
en2pl/wmt20	6.2
pl2en/wmt20	13.5
en2nl/iwslt14	8.8
nl2en/iwslt14	27.1
en2pt/opus100	18.9
pt2en/opus100	29.2

Zero-shot Directions

row: source language
column: target language We report sacreBLEU in the following table.

12e12d	ar	zh	nl	fr	de	ru
ar	-	32.5	3.2	22.8	11.2	16.7
zh	6.5	-	1.9	32.9	7.6	23.7
nl	1.7	8.2	-	7.5	10.2	2.9
fr	6.2	42.3	7.5	-	18.9	24.4
de	4.9	21.6	9.2	24.7	-	14.4
ru	7.1	40.6	4.5	29.9	13.5	-

Training

export NUM_GPU=4 && bash train_w_mono.sh ${model_config}

We give example of ${model_config} in ${PROJECT_REPO}/examples/configs/parallel_mono_12e12d_contrastive.yml

Inference

You must pre-pend the corresponding language token to the source side before binarize the test data.

${final_res_file} python3 ${repo_dir}/scripts/utils.py ${res_file} ${ref_file} || exit 1; ">

fairseq-generate ${test_path} \
    --user-dir ${repo_dir}/mcolt \
    -s ${src} \
    -t ${tgt} \
    --skip-invalid-size-inputs-valid-test \
    --path ${ckpts} \
    --max-tokens ${batch_size} \
    --task translation_w_langtok \
    ${options} \
    --lang-prefix-tok "LANG_TOK_"`echo "${tgt} " | tr '[a-z]' '[A-Z]'` \
    --max-source-positions ${max_source_positions} \
    --max-target-positions ${max_target_positions} \
    --nbest 1 | grep -E '[S|H|P|T]-[0-9]+' > ${final_res_file}
python3 ${repo_dir}/scripts/utils.py ${res_file} ${ref_file} || exit 1;

Synonym dictionaries

We use the bilingual synonym dictionaries provised by MUSE.

We generate multilingual synonym dictionaries using this script, and apply RAS using this script.

Description	File	Size
dep=1	synonym_dict_raw_dep1	138.0 M
dep=2	synonym_dict_raw_dep2	1.6 G
dep=3	synonym_dict_raw_dep3	2.2 G

Contact

Please contact me via e-mail [email protected] or via wechat/zhihu

Citation

Please cite as:

@inproceedings{mrasp2,
  title = {Contrastive Learning for Many-to-many Multilingual Neural Machine Translation},
  author= {Xiao Pan and
           Mingxuan Wang and
           Liwei Wu and
           Lei Li},
  booktitle = {Proceedings of ACL 2021},
  year = {2021},
}

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

Related tags

Overview

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

News

Introduction

Pre-requisite

Training Data and Checkpoints

Dataset

Checkpoints & Results

English-centric Directions

Unsupervised Directions

Zero-shot Directions

Training

Inference

Synonym dictionaries

Contact

Citation

Owner

Implementation of fast algorithms for Maximum Spanning Tree (MST) parsing that includes fast ArcMax+Reweighting+Tarjan algorithm for single-root dependency parsing.

Code for intrusion detection system (IDS) development using CNN models and transfer learning

《Deep Single Portrait Image Relighting》(ICCV 2019)

NasirKhusraw - The TSP solved using genetic algorithm and show TSP path overlaid on a map of the Iran provinces & their capitals.

Official implementation for "Low-light Image Enhancement via Breaking Down the Darkness"

Computer Vision is an elective course of MSAI, SCSE, NTU, Singapore

Doubly Robust Off-Policy Evaluation for Ranking Policies under the Cascade Behavior Model

Companion repo of the UCC 2021 paper "Predictive Auto-scaling with OpenStack Monasca"

Using pretrained GROVER to extract the atomic fingerprints from molecule

GrailQA: Strongly Generalizable Question Answering

Self-Supervised Generative Style Transfer for One-Shot Medical Image Segmentation

TilinGNN: Learning to Tile with Self-Supervised Graph Neural Network (SIGGRAPH 2020)

Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).

a delightful machine learning tool that allows you to train, test and use models without writing code

🔎 Monitor deep learning model training and hardware usage from your mobile phone 📱

the official code for ICRA 2021 Paper: "Multimodal Scale Consistency and Awareness for Monocular Self-Supervised Depth Estimation"

This repo provides a demo for the CVPR 2021 paper "A Fourier-based Framework for Domain Generalization" on the PACS dataset.

TANL: Structured Prediction as Translation between Augmented Natural Languages

[CVPR 2021] Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

Light-SERNet: A lightweight fully convolutional neural network for speech emotion recognition