Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

Overview

TRICE: a task-agnostic transferring framework for multi-source sequence generation

This is the source code of our work Transfer Learning for Sequence Generation: from Single-source to Multi-source (ACL 2021).

We propose TRICE, a task-agnostic Transferring fRamework for multI-sourCe sEquence generation, for transferring pretrained models to multi-source sequence generation tasks (e.g., automatic post-editing, multi-source translation, and multi-document summarization). TRICE achieves new state-of-the-art results on the WMT17 APE task and the multi-source translation task using the WMT14 test set. Welcome to take a quick glance at our blog.

The implementation is on top of the open-source NMT toolkit THUMT.

@misc{huang2021transfer,
      title={Transfer Learning for Sequence Generation: from Single-source to Multi-source}, 
      author={Xuancheng Huang and Jingfang Xu and Maosong Sun and Yang Liu},
      year={2021},
      eprint={2105.14809},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contents

Prerequisites

  • Python >= 3.6
  • tensorflow-cpu >= 2.0
  • torch >= 1.7
  • transformers >= 3.4
  • sentencepiece >= 0.1

Pretrained model

We adopt mbart-large-cc25 in our experiments. Other sequence-to-sequence pretrained models can also be used with only a few modifications.

If your GPUs do not have enough memories, you can prune the original large vocabulary (25k) to a small vocabulary (e.g., 3k) with little performance loss.

Finetuning

Single-source finetuning

PYTHONPATH=${path_to_TRICE} \
python ${path_to_TRICE}/thumt/bin/trainer.py \
    --input ${train_src1} ${train_src2} ${train_trg} \
    --vocabulary ${vocab_joint} ${vocab_joint} \
    --validation ${dev_src1} ${dev_src2} \
    --references ${dev_ref} \
    --model transformer --half --hparam_set big \
    --output single_finetuned \
    --parameters \
fixed_batch_size=false,batch_size=820,train_steps=120000,update_cycle=5,device_list=[0,1,2,3],\
keep_checkpoint_max=2,save_checkpoint_steps=2000,\
eval_steps=2001,decode_alpha=1.0,decode_batch_size=16,keep_top_checkpoint_max=1,\
attention_dropout=0.1,relu_dropout=0.1,residual_dropout=0.1,learning_rate=5e-05,warmup_steps=4000,initial_learning_rate=5e-8,\
separate_encode=false,separate_cross_att=false,segment_embedding=false,\
input_type="single_random",adapter_type="None",num_fine_encoder_layers=0,normalization="after",\
src_lang_tok="en_XX",hyp_lang_tok="de_DE",tgt_lang_tok="de_DE",mbart_model_code="facebook/mbart-large-cc25",\
spm_path="sentence.bpe.model",pad="<pad>",bos="<s>",eos="</s>",unk="<unk>"

Multi-source finetuning

PYTHONPATH=${path_to_TRICE} \
python ${path_to_TRICE}/thumt/bin/trainer.py \
    --input ${train_src1} ${train_src2} ${train_tgt} \
    --vocabulary ${vocab_joint} ${vocab_joint} \
    --validation ${dev_src1} ${dev_src2} \
    --references ${dev_ref} \
    --model transformer --half --hparam_set big \
    --checkpoint single_finetuned/eval/model-best.pt \
    --output multi_finetuned \
    --parameters \
fixed_batch_size=false,batch_size=820,train_steps=120000,update_cycle=5,device_list=[0,1,2,3],\
keep_checkpoint_max=2,save_checkpoint_steps=2000,\
eval_steps=2001,decode_alpha=1.0,decode_batch_size=16,keep_top_checkpoint_max=1,\
attention_dropout=0.1,relu_dropout=0.1,residual_dropout=0.1,learning_rate=5e-05,warmup_steps=4000,initial_learning_rate=5e-8,special_learning_rate=5e-04,special_var_name="adapter",\
separate_encode=false,separate_cross_att=true,segment_embedding=true,\
input_type="",adapter_type="Cross-attn",num_fine_encoder_layers=1,normalization="after",\
src_lang_tok="en_XX",hyp_lang_tok="de_DE",tgt_lang_tok="de_DE",mbart_model_code="facebook/mbart-large-cc25",\
spm_path="sentence.bpe.model",pad="<pad>",bos="<s>",eos="</s>",unk="<unk>"

Arguments to be explained

** special_learning_rate: if a variable's name contains special_var_name, the learning rate of it will be special_learning_rate. We give the fine encoder a larger learning rate.
** separate_encode: whether to encode multiple sources separately before the fine encoder.
** separate_cross_att: whether to use separated cross-attention described in our paper.
** segment_embedding: whether to use sinusoidal segment embedding described in our paper.
** input_type: "single_random" for single-source finetuning , "" for multi-source finetuning.
** adapter_type: "None" for no fine encoder, "Cross-attn" for fine encoder with cross-attention.
** num_fine_encoder_layers: number of fine encoder layers.
** src_lang_tok: language token for the first source sentence. Please refer to here for language tokens for all 25 languages.
** hyp_lang_tok: language token for the second source sentence.
** tgt_lang_tok: language token for the target sentence.
** mbart_model_code: model code for transformers.
** spm_path: sentence piece model (can download from here).

Explanations for other arguments could be found in the user manual of THUMT.

Inference

PYTHONPATH=${path_to_TRICE} \
python ${path_to_TRICE}/thumt/bin/translator.py \
  --input ${test_src1} ${test_src2} --output ${test_tgt} \
  --vocabulary ${vocab_joint} ${vocab_joint} \
  --checkpoints multi_finetuned/eval/model-best.pt \
  --model transformer --half \
  --parameters device_list=[0,1,2,3],decode_alpha=1.0,decode_batch_size=32
# recover sentence piece tokenization
...
# calculate BLEU
...

Contact

If you have questions, suggestions and bug reports, please email [email protected].

Owner
THUNLP-MT
Machine Translation Group, Natural Language Processing Lab at Tsinghua University (THUNLP). Please refer to https://github.com/thunlp for more NLP resources.
THUNLP-MT
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
Korean Sentence Embedding Repository

Korean-Sentence-Embedding 🍭 Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

80 Jan 02, 2023
Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 Billion Parameters) on a single 16 GB VRAM V100 Google Cloud instance with Huggingfa

289 Jan 06, 2023
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

13.2k Jul 07, 2021
Text Classification Using LSTM

Text classification is the task of assigning a set of predefined categories to free text. Text classifiers can be used to organize, structure, and categorize pretty much anything. For example, new ar

KrishArul26 3 Jan 03, 2023
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

Microsoft 105 Jan 08, 2022
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

pyannote 2.2k Jan 09, 2023
FewCLUE: 为中文NLP定制的小样本学习测评基准

FewCLUE: 为中文NLP定制的小样本学习测评基准

CLUE benchmark 387 Jan 04, 2023
Différents programmes créant une interface graphique a l'aide de Tkinter pour simplifier la vie des étudiants.

GP211-Grand-Projet Ce repertoire contient tout les programmes nécessaires au bon fonctionnement de notre projet-logiciel. Cette interface graphique es

1 Dec 21, 2021
An official repository for tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a University of Edinburgh master's course.

PMR computer tutorials on HMMs (2021-2022) This is a repository for computer tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a Univer

Vaidotas Šimkus 10 Dec 06, 2022
Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

This repository contains the code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

Chenhe Dong 28 Nov 10, 2022
Natural Language Processing Specialization

Natural Language Processing Specialization In this folder, Natural Language Processing Specialization projects and notes can be found. WHAT I LEARNED

Kaan BOKE 3 Oct 06, 2022
BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

303 Dec 17, 2022
Train and use generative text models in a few lines of code.

blather Train and use generative text models in a few lines of code. To see blather in action check out the colab notebook! Installation Use the packa

Dan Carroll 16 Nov 07, 2022
German Text-To-Speech Engine using Tacotron and Griffin-Lim

jotts JoTTS is a German text-to-speech engine using tacotron and griffin-lim. The synthesizer model has been trained on my voice using Tacotron1. Due

padmalcom 6 Aug 28, 2022
🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

floret: fastText + Bloom embeddings for compact, full-coverage vectors with spaCy floret is an extended version of fastText that can produce word repr

Explosion 222 Dec 16, 2022
Deep learning for NLP crash course at ABBYY.

Deep NLP Course at ABBYY Deep learning for NLP crash course at ABBYY. Suggested textbook: Neural Network Methods in Natural Language Processing by Yoa

Dan Anastasyev 597 Dec 18, 2022
Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

japanese-ebook-analysis This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technic

Christoffer Aakre 14 Jul 23, 2022
Protein Language Model

ProteinLM We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing P

THUDM 77 Dec 27, 2022
Deploying a Text Summarization NLP use case on Docker Container Utilizing Nvidia GPU

GPU Docker NLP Application Deployment Deploying a Text Summarization NLP use case on Docker Container Utilizing Nvidia GPU, to setup the enviroment on

Ritesh Yadav 9 Oct 14, 2022