Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Overview

Mask-Align: Self-Supervised Neural Word Alignment

This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment.

@inproceedings{chen2021maskalign,
   title={Mask-Align: Self-Supervised Neural Word Alignment},
   author={Chi Chen and Maosong Sun and Yang Liu},
   booktitle={Association for Computational Linguistics (ACL)},
   year={2021}
}

The implementation is built on top of THUMT.

Contents

Introduction

Mask-Align is a self-supervised neural word aligner. It parallelly masks out each target token and predicts it conditioned on both source and the remaining target tokens. The source token that contributes most to recovering a masked target token will be aligned to that target token.

Prerequisites

  • PyTorch
  • NLTK
  • remi *
  • pyecharts *
  • pandas *
  • matplotlib *
  • seaborn *

*: optional, only used for Visualization.

Usage

Data Preparation

To get the data used in our paper, you can follow the instructions in https://github.com/lilt/alignment-scripts.

To train an aligner with your own data, you should pre-process it yourself. Usually this includes tokenization, BPE, etc. You can find a simple guide here.

Now we have the pre-processed parallel training data (train.src, train.tgt), validation data (optional) (valid.src, valid.tgt) and test data (test.src, test.tgt). An example 3-sentence German–English parallel training corpus is:

# train.src
wiederaufnahme der sitzungsperiode
frau präsidentin , zur geschäfts @@ordnung .
ich bitte sie , sich zu einer schweigeminute zu erheben .

# train.tgt
resumption of the session
madam president , on a point of order .
please rise , then , for this minute ' s silence .

The next step is to shuffle the training set, which proves to be helpful for improving the results.

python thualign/scripts/shuffle_corpus.py --corpus train.src train.tgt

The resulting files train.src.shuf and train.tgt.shuf rearrange the sentence pairs randomly.

Then we need to generate vocabulary from the training set.

python thualign/scripts/build_vocab.py train.src.shuf vocab.train.src
python thualign/scripts/build_vocab.py train.tgt.shuf vocab.train.tgt

The resulting files vocab.train.src.txt and vocab.train.tgt.txt are final source and target vocabularies used for model training.

Training

All experiments are configured via config files in thualign/configs, see Configs for more details.. We provide an example config file thualign/configs/user/example.config. You can easily use it by making three changes:

  1. change device_list, update_cycle and batch_size to match your machine configuration;

  2. change exp_dir and output to your own experiment directory

  3. change train/valid/test_input and vocab to your data paths;

When properly configured, you can use the following command to train an alignment model described in the config file

bash thualign/bin/train.sh -s thualign/configs/user/example.config

or more simply

bash thualign/bin/train.sh -s example

The configuration file is an INI file and is parsed through configparser. By adding a new section, you can easily customize some configs while keep other configs unchanged.

[DEFAULT]
...

[small_budget]
batch_size = 4500
update_cycle = 8
device_list = [0]
half = False

Use -e option to run this small_budget section

bash thualign/bin/train.sh -s example -e small_budget

You can also monitor the training process through tensorboard

tensorboard --logdir=[output]

Test

After training, the following command can be used to generate attention weights (-g), generate data for attention visualization (-v), and test its AER (-t) if test_ref is provided.

bash thualign/bin/test.sh -s [CONFIG] -e [EXP] -gvt

For example, to test the model trained with the configs in example.config

bash thualign/bin/test.sh -s example -gvt

You might get the following output

alignment-soft.txt: 14.4% (87.7%/83.5%/9467)

The alignment results (alignment.txt) along with other test results are stored in [output]/test by default.

Configs

Most of the configuration of Mask-Align is done through configuration files in thualign/configs. The model reads the basic configs first, followed by the user-defined configs.

Basic Config

Predefined configs for experiments to use.

  • base.config: basic configs for training, validation and test

  • model.config: define different models with their hyperparameters

User Config

Customized configs that must describe the following configuration and maybe other experiment-specific parameters:

  • train/valid/test_input: paths of input parallel corpuses
  • vocab: paths of vocabulary files generated from thualign/scripts/build_vocab.py
  • output: path to save the model outputs
  • model: which model to use
  • batch_size: the batch size (number of tokens) used in the training stage.
  • update_cycle: the number of iterations for updating model parameters. The default value is 1. If you have only 1 GPU and want to obtain the same translation performance with using 4 GPUs, simply set this parameter to 4. Note that the training time will also be prolonged.
  • device_list: the list of GPUs to be used in training. Use the nvidia-smi command to find unused GPUs. If the unused GPUs are gpu0 and gpu1, set this parameter as device_list=[0,1].
  • half: set this to True if you wish to use half-precision training. This will speeds up the training procedure. Make sure that you have the GPUs with half-precision support.

Here is a minimal experiment config:

### thualign/configs/user/example.config
[DEFAULT]

train_input = ['train.src', 'train.tgt']
valid_input = ['valid.src', 'valid.tgt']
vocab = ['vocab.src.txt', 'vocab.tgt.txt']
test_input = ['test.src', 'test.tgt']
test_ref = test.talp

exp_dir = exp
label = agree_deen
output = ${exp_dir}/${label}

model = mask_align

batch_size = 9000
update_cycle = 1
device_list = [0,1,2,3]
half = True

Visualization

To better understand and analyze the model, Mask-Align supports the following two types of visulizations.

Training Visualization

Add eval_plot = True in your config file to turn on visualization during training. This will plot 5 attention maps from evaluation in the tensorboard.

These packages are required for training visualization:

  • pandas
  • matplotlib
  • seaborn

Attention Visualization

Use -v in the test command to generate alignment_vizdata.pt first. It is stored in [output]/test by default. To visualize it, using this script

python thualign/scripts/visualize.py [output]/test/alignment_vizdata.pt [--port PORT]

This will start a local service that plots the attention weights for all the test sentence pairs. You can access it through a web browser.

These packages are required for training visualization:

  • remi
  • pyecharts

Contact

If you have questions, suggestions and bug reports, please email [email protected].

Owner
THUNLP-MT
Machine Translation Group, Natural Language Processing Lab at Tsinghua University (THUNLP). Please refer to https://github.com/thunlp for more NLP resources.
THUNLP-MT
Espial is an engine for automated organization and discovery of personal knowledge

Live Demo (currently not running, on it) Espial is an engine for automated organization and discovery in knowledge bases. It can be adapted to run wit

Uzay-G 159 Dec 30, 2022
Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".

Patience-based Early Exit Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit". NEWS: We now have a better and tidier i

Kevin Canwen Xu 54 Jan 04, 2023
Transformers Wav2Vec2 + Parlance's CTCDecodeTransformers Wav2Vec2 + Parlance's CTCDecode

🤗 Transformers Wav2Vec2 + Parlance's CTCDecode Introduction This repo shows how 🤗 Transformers can be used in combination with Parlance's ctcdecode

Patrick von Platen 9 Jul 21, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
Signature remover is a NLP based solution which removes email signatures from the rest of the text.

Signature Remover Signature remover is a NLP based solution which removes email signatures from the rest of the text. It helps to enchance data conten

Forges Alterway 8 Jan 06, 2023
The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation .

Qian Wang 21 Dec 17, 2022
A collection of GNN-based fake news detection models.

This repo includes the Pytorch-Geometric implementation of a series of Graph Neural Network (GNN) based fake news detection models. All GNN models are implemented and evaluated under the User Prefere

SafeGraph 251 Jan 01, 2023
DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

DeepAmandine This is an artificial intelligence based on GPT-3 that you can chat with, it is very nice and makes a lot of jokes. We wish you a good ex

BuyWithCrypto 3 Apr 19, 2022
Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

VAD-SLI-ASR Python scripts for a speech processing pipeline with Voice Activity

Dynamics of Language 14 Dec 09, 2022
Under the hood working of transformers, fine-tuning GPT-3 models, DeBERTa, vision models, and the start of Metaverse, using a variety of NLP platforms: Hugging Face, OpenAI API, Trax, and AllenNLP

Transformers-for-NLP-2nd-Edition @copyright 2022, Packt Publishing, Denis Rothman Contact me for any question you have on LinkedIn Get the book on Ama

Denis Rothman 150 Dec 23, 2022
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

Life4 3k Jan 06, 2023
Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

Auto-Research A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting arti

Sidharth Pal 20 Dec 14, 2022
The ibet-Prime security token management system for ibet network.

ibet-Prime The ibet-Prime security token management system for ibet network. Features ibet-Prime is an API service that enables the issuance and manag

BOOSTRY 8 Dec 22, 2022
GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

Octanove Labs 27 Jan 05, 2023
nlpcommon is a python Open Source Toolkit for text classification.

nlpcommon nlpcommon, Python Text Tool. Guide Feature Install Usage Dataset Contact Cite Reference Feature nlpcommon is a python Open Source

xuming 3 May 29, 2022
Production First and Production Ready End-to-End Keyword Spotting Toolkit

Production First and Production Ready End-to-End Keyword Spotting Toolkit

223 Jan 02, 2023
Simple GUI where you can enter an article and get a crisp summarized version.

Text-Summarization-using-TextRank-BART Simple GUI where you can enter an article and get a crisp summarized version. How to run: Clone the repo Instal

Rohit P 4 Sep 28, 2022
Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference Source code for RCDG model in AAAI20 Generating Persona Consistent Di

16 Oct 08, 2022
A paper list of pre-trained language models (PLMs).

Large-scale pre-trained language models (PLMs) such as BERT and GPT have achieved great success and become a milestone in NLP.

RUCAIBox 124 Jan 02, 2023
XLNet: Generalized Autoregressive Pretraining for Language Understanding

Introduction XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective.

Zihang Dai 6k Jan 07, 2023