Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

[email protected]">

Last update: Oct 26, 2022

Overview

BERTGEN

This repository is the implementation of the paper "BERTGEN: Multi-task Generation through BERT" (https://arxiv.org/abs/2106.03484). The codebase is based on the VL-BERT official repository (https://github.com/jackroos/VL-BERT) presented in the paper VL-BERT: Pre-training of Generic Visual-Linguistic Representations.

Introduction

BERTGEN extends the VL-BERT model by making it multilingual, inheriting multilingual pretraining from multilingual BERT (https://github.com/google-research/bert/blob/master/multilingual.md. The BERTGEN model produces multilingual, multimodal embeddings usede for visual-linguistic generation tasks.

BERTGEN takes advantage of large-scale training of VL-BERT and M-BERT but is also further trained, in a generative setting as described in the paper.

Figure 1: Overview of the BERTGEN architecture

Special thanks to VL-BERT, PyTorch and its 3rd-party libraries and BERT. This codebase also uses the following features inherited from VL-BERT:

Distributed Training
Various Optimizers and Learning Rate Schedulers
Gradient Accumulation
Monitoring the Training Using TensorboardX

Prepare

Environment

Ubuntu 16.04, CUDA 9.0, GCC 4.9.4

Python 3.6.x

# We recommend you to use Anaconda/Miniconda to create a conda environment
conda create -n bertgen python=3.6 pip
conda activate bertgen

PyTorch 1.0.0 or 1.1.0

conda install pytorch=1.1.0 cudatoolkit=9.0 -c pytorch

Apex (optional, for speed-up and fp16 training)

git clone https://github.com/jackroos/apex
cd ./apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Other requirements:

pip install Cython
pip install -r requirements.txt

Compile
```
./scripts/init.sh
```

Data

The datasets used for training and evaluating BERTGEN can be found in this zenodo link. After checking out the code repository, simply extract the .tar.gz file downloaded from zenodo to <github checkout folder>/data. A README file is included in the download with more information on the structure of the datasets.

Pre-trained Models

See PREPARE_PRETRAINED_MODELS.md.

Training

Distributed Training on Single-Machine

./scripts/dist_run_single.sh <num_gpus> <task>/train_end2end.py <path_to_cfg> <dir_to_store_checkpoint>

<num_gpus>: number of gpus to use.
<task>: LanguageGeneration.
<path_to_cfg>: config yaml file under ./cfgs/<task>.
<dir_to_store_checkpoint>: root directory to store checkpoints.

Following is a more concrete example:

./scripts/dist_run_single.sh 4 LanguageGeneration/train_end2end.py ./cfgs/multitask_training/base_prec_multitask_train_global.yaml ./checkpoints

Distributed Training on Multi-Machine

For example, on 2 machines (A and B), each with 4 GPUs,

run following command on machine A:

./scripts/dist_run_multi.sh 2 0 <ip_addr_of_A> 4 <task>/train_end2end.py <path_to_cfg> <dir_to_store_checkpoint>

run following command on machine B:

./scripts/dist_run_multi.sh 2 1 <ip_addr_of_A> 4 <task>/train_end2end.py <path_to_cfg> <dir_to_store_checkpoint>

Training:
- multitask training: "MODULE: BERTGENMultitaskTraining"

Non-Distributed Training

./scripts/nondist_run.sh <task>/train_end2end.py <path_to_cfg> <dir_to_store_checkpoint>

Note:

In yaml files under ./cfgs, we set batch size for GPUs with at least 32G memory, you may need to adapt the batch size and gradient accumulation steps according to your actual case, e.g., if you decrease the batch size, you should also increase the gradient accumulation steps accordingly to keep 'actual' batch size for SGD unchanged. Note that for the multitask training of 13 tasks the batch size is set to the minimum of 1 sample from each dataset per task. You would have to reduce the number of datasets to fit on a GPU with smaller memory than 32G.
For efficiency, we recommend you to use distributed training even on single-machine.

Evaluation

Language Generation tasks (MT, MMT, IC)

Generate prediction results on selected test dataset (specified in yaml). The task is also specified in the .yaml file (MT, MMT, IC):

python LanguageGeneration/test.py \
  --cfg <cfg_of_downstream_task> \
  --ckpt <checkpoint_of_pretrained_model> \
  --gpus <indexes_of_gpus_to_use> \
  --result-path <dir_to_save_result> --result-name <result_file_name>

Inference:
- Machine Translation: "MODULE: BERTGENGenerateMMT"
- Multimodal Machine Translation: "MODULE: BERTGENGenerateMT"
- Image Captioning: "MODULE: BERTGENGenerateImageOnly"

Evaluation Metrics

After generating results, the generated text file can be compared with the ground truth in tokenised format. We have used the nmtpytoch tool for generating these metrics. An example is shown below

nmtpy-coco-metrics -l de "./checkpoints/generated/ENDEIMG.txt" -r "./data/ground_truths/ENDEIMG.txt.tok

Acknowledgements

Many thanks to following codebases that have been essential while building this codebase:

Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

Related tags

Overview

BERTGEN

Introduction

Prepare

Environment

Data

Pre-trained Models

Training

Distributed Training on Single-Machine

Distributed Training on Multi-Machine

Non-Distributed Training

Evaluation

Language Generation tasks (MT, MMT, IC)

Evaluation Metrics

Acknowledgements

Owner

[email protected]

Python powered crossword generator with database with 20k+ polish words

Must-read papers on improving efficiency for pre-trained language models.

HAIS_2GNN: 3D Visual Grounding with Graph and Attention

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

A tool helps build a talk preview image by combining the given background image and talk event description

Perform sentiment analysis on textual data that people generally post on websites like social networks and movie review sites.

Signature remover is a NLP based solution which removes email signatures from the rest of the text.

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

PyTranslator é simultaneamente um editor e tradutor de texto com diversos recursos e interface feito com coração e 100% em Python

Material for GW4SHM workshop, 16/03/2022.

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

State of the Art Natural Language Processing

CredData is a set of files including credentials in open source projects

This github repo is for Neurips 2021 paper, NORESQA A Framework for Speech Quality Assessment using Non-Matching References.

This code extends the neural style transfer image processing technique to video by generating smooth transitions between several reference style images

Implementation of TF-IDF algorithm to find documents similarity with cosine similarity

Ongoing research training transformer language models at scale, including: BERT & GPT-2

This project aims to conduct a text information retrieval and text mining on medical research publication regarding Covid19 - treatments and vaccinations.

Black for Python docstrings and reStructuredText (rst).