A benchmark for the task of translation suggestion

Last update: Dec 24, 2022

Related tags

Overview

WeTS: A Benchmark for Translation Suggestion

Translation Suggestion (TS), which provides alternatives for specific words or phrases given the entire documents translated by machine translation (MT) has been proven to play a significant role in post editing (PE). WeTS is a benchmark data set for TS, which is annotated by expert translators. WeTS contains corpus(train/dev/test) for four different translation directions, i.e., English2German, German2English, Chinese2English and English2Chinese.

Data
Models
Get Started
Citation
Licence

Data

WeTS is a benchmark dataset for TS, where all the examples are annotated by expert translators. As far as we know, this is the first golden corpus for TS. The statistics about WeTS are listed in the following table:

Translation Direction	Train	Valid	Test
English2German	14,957	1000	1000
German2English	11,777	1000	1000
English2Chinese	15,769	1000	1000
Chinese2English	21,213	1000	1000

For corpus in each direction, the data is organized as:
direction.split.src: the source-side sentences
direction.split.mask: the masked translation sentences, the placeholder is "<MASK>"
direction.split.tgt: the predicted suggestions, the test set for English2Chinese has three references for each example

direction: En2De, De2En, Zh2En, En2Zh
split: train, dev, test

Models

We release the pre-trained NMT models which are used to generate the MT sentences. Additionally, the released NMT models can be used to generate synthetic corpus for TS, which can improve the final performance dramatically.Detailed description about the way of generating synthetic corpus can be found in our paper.

The released models can be downloaded at:

Download the models

and the password is "2iyk"

For inference with the released model, we can:

sh inference_*direction*.sh

direction can be: en2de, de2en, en2zh, zh2en

Get Started

data preprocessing

sh process.sh

pre-training

Codes for the first-phase pre-training are not included in this repo, as we directly utilized the codes of XLM (https://github.com/facebookresearch/XLM) with little modiafication. And we did not achieve much gains with the first-phase pretraining.

The second-phase pre-training:

sh preptraining.sh

fine-tuning

sh finetuning.sh

Codes in this repo is mainly forked from fairseq (https://github.com/pytorch/fairseq.git)

Citation

Please cite the following paper if you found the resources in this repository useful.

@article{yang2021wets,
  title={WeTS: A Benchmark for Translation Suggestion},
  author={Yang, Zhen and Zhang, Yingxue and Li, Ernan and Meng, Fandong and Zhou, Jie},
  journal={arXiv preprint arXiv:2110.05151},
  year={2021}
}

LICENCE

See LICENCE

A benchmark for the task of translation suggestion

Related tags

Overview

WeTS: A Benchmark for Translation Suggestion

Contents

Data

Models

Get Started

data preprocessing

pre-training

fine-tuning

Citation

LICENCE

Owner

zhyang

I3-master-layout - Simple master and stack layout script

Self-supervised Augmentation Consistency for Adapting Semantic Segmentation (CVPR 2021)

code for "Feature Importance-aware Transferable Adversarial Attacks"

A wrapper around SageMaker ML Lineage Tracking extending ML Lineage to end-to-end ML lifecycles, including additional capabilities around Feature Store groups, queries, and other relevant artifacts.

BRepNet: A topological message passing system for solid models

MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

A heterogeneous entity-augmented academic language model based on Open Academic Graph (OAG)

Einshape: DSL-based reshaping library for JAX and other frameworks.

Codes and models of NeurIPS2021 paper - DominoSearch: Find layer-wise fine-grained N:M sparse schemes from dense neural networks

Official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding.

YoloAll is a collection of yolo all versions. you you use YoloAll to test yolov3/yolov5/yolox/yolo_fastest

public repo for ESTER dataset and modeling (EMNLP'21)

This is the pytorch implementation of the paper - Axiomatic Attribution for Deep Networks.

Deep Image Search is an AI-based image search engine that includes deep transfor learning features Extraction and tree-based vectorized search.

Source Code for ICSE 2022 Paper - ``Can We Achieve Fairness Using Semi-Supervised Learning?''

This repository stores the code to reproduce the results published in "TiWS-iForest: Isolation Forest in Weakly Supervised and Tiny ML scenarios"

Local Multi-Head Channel Self-Attention for FER2013

Adversarial Graph Representation Adaptation for Cross-Domain Facial Expression Recognition (AGRA, ACM 2020, Oral)

Code for binary and multiclass model change active learning, with spectral truncation implementation.

Real-time ground filtering algorithm of cloud points acquired using Terrestrial Laser Scanner (TLS)