PyTorch source code of NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models"

Related tags

Text Data & NLPsiatl
Overview

This repository contains source code for NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models" (Paper link)

Introduction

This paper presents a simple transfer learning approach that addresses the problem of catastrophic forgetting. We pretrain a language model and then transfer it to a new model, to which we add a recurrent layer and an attention mechanism. Based on multi-task learning, we use a weighted sum of losses (language model loss and classification loss) and fine-tune the pretrained model on our (classification) task.

Architecture

Step 1:

  • Pretraining of a word-level LSTM-based language model

Step 2:

  • Fine-tuning the language model (LM) on a classification task

  • Use of an auxiliary LM loss

  • Employing 2 different optimizers (1 for the pretrained part and 1 for the newly added part)

  • Sequentially unfreezing

Reference

@inproceedings{chronopoulou-etal-2019-embarrassingly,
    title = "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models",
    author = "Chronopoulou, Alexandra  and
      Baziotis, Christos  and
      Potamianos, Alexandros",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1213",
    pages = "2089--2095",
}

Prerequisites

Dependencies

  • PyTorch version >=0.4.0

  • Python version >= 3.6

Install Requirements

Create Environment (Optional): Ideally, you should create a conda environment for the project.

conda create -n siatl python=3
conda activate siatl

Install PyTorch 0.4.0 with the desired cuda version to use the GPU:

conda install pytorch==0.4.0 torchvision -c pytorch

Then install the rest of the requirements:

pip install -r requirements.txt

Download Data

You can find Sarcasm Corpus V2 (link) under datasets/

Plot visualization

Visdom is used to visualized metrics during training. You should start the server through the command line (using tmux or screen) by typing visdom. You will be then able to see the visualizations by going to http://localhost:8097 in your browser.

Check here for more: https://github.com/facebookresearch/visdom#usage

Training

In order to train the model, either the LM or the SiATL, you need to run the corresponding python script and pass as an argument a yaml model config. The yaml config specifies all the configuration details of the experiment to be conducted. To make any changes to a model, change an existing or create a new yaml config file.

The yaml config files can be found under model_configs/ directory.

Use the pretrained Language Model:

cd checkpoints/
wget https://www.dropbox.com/s/lalizxf3qs4qd3a/lm20m_70K.pt 

(Download it and place it in checkpoints/ directory)

(Optional) Train a Language Model:

Assuming you have placed the training and validation data under datasets/<name_of_your_corpus/train.txt, datasets/<name_of_your_corpus/valid.txt (check the model_configs/lm_20m_word.yaml's data section), you can train a LM.

See for example:

python models/sent_lm.py -i lm_20m_word.yaml

Fine-tune the Language Model on the labeled dataset, using an auxiliary LM loss, 2 optimizers and sequential unfreezing, as described in the paper:

To fine-tune it on the Sarcasm Corpus V2 dataset:

python models/run_clf.py -i SCV2_aux_ft_gu.yaml --aux_loss --transfer

  • -i: Configuration yaml file (under model_configs/)
  • --aux_loss: You can choose if you want to use an auxiliary LM loss
  • --transfer: You can choose if you want to use a pretrained LM to initalize the embedding and hidden layer of your model. If not, they will be randomly initialized
Owner
Alexandra Chronopoulou
Research Intern at AllenAI. CS PhD student in LMU Munich.
Alexandra Chronopoulou
The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

This repository contains the raw dataset used in NHNet [1] for the task of News Story Headline Generation. The code of data processing and training is available under Tensorflow Models - NHNet.

Google Research Datasets 31 Jul 15, 2022
SummerTime - Text Summarization Toolkit for Non-experts

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

Yale-LILY 213 Jan 04, 2023
Model parallel transformers in JAX and Haiku

Table of contents Mesh Transformer JAX Updates Pretrained Models GPT-J-6B Links Acknowledgments License Model Details Zero-Shot Evaluations Architectu

Ben Wang 4.9k Jan 04, 2023
2021海华AI挑战赛·中文阅读理解·技术组·第三名

文字是人类用以记录和表达的最基本工具,也是信息传播的重要媒介。透过文字与符号,我们可以追寻人类文明的起源,可以传播知识与经验,读懂文字是认识与了解的第一步。对于人工智能而言,它的核心问题之一就是认知,而认知的核心则是语义理解。

21 Dec 26, 2022
Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers 이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-trans

Junghyun 40 Dec 20, 2022
Turkish Stop Words Türkçe Dolgu Sözcükleri

trstop Turkish Stop Words Türkçe Dolgu Sözcükleri In this repository I put Turkish stop words that is contained in the first 10 thousand words with th

Ahmet Aksoy 103 Nov 12, 2022
Plugin repository for Macast

Macast-plugins Plugin repository for Macast. How to use third-party player plugin Download Macast from GitHub Release. Download the plugin you want fr

109 Jan 04, 2023
NLP Core Library and Model Zoo based on PaddlePaddle 2.0

PaddleNLP 2.0拥有丰富的模型库、简洁易用的API与高性能的分布式训练的能力,旨在为飞桨开发者提升文本建模效率,并提供基于PaddlePaddle 2.0的NLP领域最佳实践。

6.9k Jan 01, 2023
Research code for "What to Pre-Train on? Efficient Intermediate Task Selection", EMNLP 2021

efficient-task-transfer This repository contains code for the experiments in our paper "What to Pre-Train on? Efficient Intermediate Task Selection".

AdapterHub 26 Dec 24, 2022
Document processing using transformers

Doc Transformers Document processing using transformers. This is still in developmental phase, currently supports only extraction of form data i.e (ke

Vishnu Nandakumar 13 Dec 21, 2022
Snowball compiler and stemming algorithms

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algori

Snowball Stemming language and algorithms 613 Jan 07, 2023
Python library for interactive topic model visualization. Port of the R LDAvis package.

pyLDAvis Python library for interactive topic model visualization. This is a port of the fabulous R package by Carson Sievert and Kenny Shirley. pyLDA

Ben Mabey 1.7k Dec 20, 2022
CDLA: A Chinese document layout analysis (CDLA) dataset

CDLA: A Chinese document layout analysis (CDLA) dataset 介绍 CDLA是一个中文文档版面分析数据集,面向中文文献类(论文)场景。包含以下10个label: 正文 标题 图片 图片标题 表格 表格标题 页眉 页脚 注释 公式 Text Title

buptlihang 84 Dec 28, 2022
COVID-19 Related NLP Papers

COVID-19 outbreak has become a global pandemic. NLP researchers are fighting the epidemic in their own way.

xcfeng 28 Oct 30, 2022
A Fast Command Analyser based on Dict and Pydantic

Alconna Alconna 隶属于ArcletProject, 在Cesloi内有内置 Alconna 是 Cesloi-CommandAnalysis 的高级版,支持解析消息链 一般情况下请当作简易的消息链解析器/命令解析器 文档 暂时的文档 Example from arclet.alcon

19 Jan 03, 2023
This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

Laura 1 Jan 28, 2022
A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

Šarūnas Navickas 60 Sep 26, 2022
Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch

Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch. Topics: Face detection with Detectron 2, Time Series anomaly detection with LSTM Autoenc

Venelin Valkov 1.8k Dec 31, 2022
Text classification is one of the popular tasks in NLP that allows a program to classify free-text documents based on pre-defined classes.

Deep-Learning-for-Text-Document-Classification Text classification is one of the popular tasks in NLP that allows a program to classify free-text docu

Happy N. Monday 2 Mar 17, 2022
An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Koniwa project 32 Dec 14, 2022