NLP library designed for reproducible experimentation management

Overview

Welcome to the Transfer NLP library, a framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP

You can have an overview of the high-level API on this Colab Notebook, which shows how to use the framework on several examples. All DL-based examples on these notebooks embed in-cell Tensorboard training monitoring!

For an example of pre-trained model finetuning, we provide a short executable tutorial on BertClassifier finetuning on this Colab Notebook

Set up your environment

mkvirtualenv transfernlp
workon transfernlp

git clone https://github.com/feedly/transfer-nlp.git
cd transfer-nlp
pip install -r requirements.txt

To use Transfer NLP as a library:

# to install the experiment builder only
pip install transfernlp
# to install Transfer NLP with PyTorch and Transfer Learning in NLP support
pip install transfernlp[torch]

or

pip install git+https://github.com/feedly/transfer-nlp.git

to get the latest state before new releases.

To use Transfer NLP with associated examples:

git clone https://github.com/feedly/transfer-nlp.git
pip install -r requirements.txt

Documentation

API documentation and an overview of the library can be found here

Reproducible Experiment Manager

The core of the library is made of an experiment builder: you define the different objects that your experiment needs, and the configuration loader builds them in a nice way. For reproducible research and easy ablation studies, the library then enforces the use of configuration files for experiments. As people have different tastes for what constitutes a good experiment file, the library allows for experiments defined in several formats:

  • Python Dictionary
  • JSON
  • YAML
  • TOML

In Transfer-NLP, an experiment config file contains all the necessary information to define entirely the experiment. This is where you will insert names of the different components your experiment will use, along with the hyperparameters you want to use. Transfer-NLP makes use of the Inversion of Control pattern, which allows you to define any class / method / function you could need, the ExperimentConfig class will create a dictionnary and instatiate your objects accordingly.

To use your own classes inside Transfer-NLP, you need to register them using the @register_plugin decorator. Instead of using a different registry for each kind of component (Models, Data loaders, Vectorizers, Optimizers, ...), only a single registry is used here, in order to enforce total customization.

If you use Transfer NLP as a dev dependency only, you might want to use it declaratively only, and call register_plugin() on objects you want to use at experiment running time.

Here is an example of how you can define an experiment in a YAML file:

data_loader:
  _name: MyDataLoader
  data_parameter: foo
  data_vectorizer:
    _name: MyVectorizer
    vectorizer_parameter: bar

model:
  _name: MyModel
  model_hyper_param: 100
  data: $data_loader

trainer:
  _name: MyTrainer
  model: $model
  data: $data_loader
  loss:
    _name: PyTorchLoss
  tensorboard_logs: $HOME/path/to/tensorboard/logs
  metrics:
    accuracy:
      _name: Accuracy

Any object can be defined through a class, method or function, given a _name parameters followed by its own parameters. Experiments are then loaded and instantiated using ExperimentConfig(experiment=experiment_path_or_dict)

Some considerations:

  • Defaults parameters can be skipped in the experiment file.

  • If an object is used in different places, you can refer to it using the $ symbol, for example here the trainer object uses the data_loader instantiated elsewhere. No ordering of objects is required.

  • For paths, you might want to use environment variables so that other machines can also run your experiments. In the previous example, you would run e.g. ExperimentConfig(experiment=yaml_path, HOME=Path.home()) to instantiate the experiment and replace $HOME by your machine home path.

  • The config instantiation allows for any complex settings with nested dict / list

You can have a look at the tests for examples of experiment settings the config loader can build. Additionally we provide runnable experiments in experiments/.

Transfer Learning in NLP: flexible PyTorch Trainers

For deep learning experiments, we provide a BaseIgniteTrainer in transfer_nlp.plugins.trainers.py. This basic trainer will take a model and some data as input, and run a whole training pipeline. We make use of the PyTorch-Ignite library to monitor events during training (logging some metrics, manipulating learning rates, checkpointing models, etc...). Tensorboard logs are also included as an option, you will have to specify a tensorboard_logs simple parameters path in the config file. Then just run tensorboard --logdir=path/to/logs in a terminal and you can monitor your experiment while it's training! Tensorboard comes with very nice utilities to keep track of the norms of your model weights, histograms, distributions, visualizing embeddings, etc so we really recommend using it.

We provide a SingleTaskTrainer class which you can use for any supervised setting dealing with one task. We are working on a MultiTaskTrainer class to deal with multi task settings, and a SingleTaskFineTuner for large models finetuning settings.

Use cases

Here are a few use cases for Transfer NLP:

  • You have all your classes / methods / functions ready. Transfer NLP allows for a clean way to centralize loading and executing your experiments
  • You have all your classes but you would like to benchmark multiple configuration settings: the ExperimentRunner class allows for sequentially running your sets of experiments, and generates personalized reporting (you only need to implement your report method in a custom ReporterABC class)
  • You want to experiment with training deep learning models but you feel overwhelmed bby all the boilerplate code in SOTA models github projects. Transfer NLP encourages separation of important objects so that you can focus on the PyTorch Module implementation and let the trainers deal with the training part (while still controlling most of the training parameters through the experiment file)
  • You want to experiment with more advanced training strategies, but you are more interested in the ideas than the implementations details. We are working on improving the advanced trainers so that it will be easier to try new ideas for multi task settings, fine-tuning strategies or model adaptation schemes.

Slack integration

While experimenting with your own models / data, the training might take some time. To get notified when your training finishes or crashes, you can use the simple library knockknock by folks at HuggingFace, which add a simple decorator to your running function to notify you via Slack, E-mail, etc.

Some objectives to reach:

  • Include examples using state of the art pre-trained models
  • Include linguistic properties to models
  • Experiment with RL for sequential tasks
  • Include probing tasks to try to understand the properties that are learned by the models

Acknowledgment

The library has been inspired by the reading of "Natural Language Processing with PyTorch" by Delip Rao and Brian McMahan. Experiments in experiments, the Vocabulary building block and embeddings nearest neighbors are taken or adapted from the code provided in the book.

Comments
  • Pytorch Lightning as a back-end

    Pytorch Lightning as a back-end

    Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

    Describe the solution you'd like A clear and concise description of what you want to happen.

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Hi! check out Pytorch Lightning as an option for your backend! We're looking for awesome project implemented in Lightning.

    https://github.com/williamFalcon/pytorch-lightning Additional context Add any other context or screenshots about the feature request here.

    opened by williamFalcon 3
  • have the possibility to build object with a function instead of a class

    have the possibility to build object with a function instead of a class

    When you want to experiment with someone else's code, you don't want to copy-paste their code.

    If you want to use a class AwesomeClass from an awesome github repo, you can do:

    from transfer_nlp.pluginf.config import register_plugin
    from awesome_repo.module import AwesomeClass
    
    register_plugin(AwesomeClass)
    

    and then use it in your experiments.

    However, when reusing complex objects, it might complicated to configure them. An example is the pre-trained model from the pytorch-pretrained-bert repo, where you can build complex models with nice one-liners such as model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4)

    It's possible to encapsulate these into other classes and have Transfer NLP build them, but it can feel awkward and adds unnecessary complexity / lines of code compared to the initial one-liner. An alternative is to build these objects with a method, in the previous example we would only write:

    @register_function
    def bert_classifier(bert_version: str='bert-base-uncased', num_labels: int=4):
        return BertForSequenceClassification.from_pretrained(pretrained_model_name_or_path=bert_version, num_labels=num_labels)
    

    and we could use functions just as methods in the config loading.

    opened by petermartigny 2
  • caching objects in experiment runner

    caching objects in experiment runner

    some readonly objects can take awhile to load in experiments (embeddings, datasets, etc). The current ExperimentRunner always recreates the entire experiment. It would be nice if we could keep some objects in memory...

    Proposal

    add a cached property in run_all

        def run_all(experiment: Union[str, Path, Dict],
                    experiment_cache: Union[str, Path, Dict],
                    experiment_config: Union[str, Path],
                    report_dir: Union[str, Path],
                    trainer_config_name: str = 'trainer',
                    reporter_config_name: str = 'reporter',
                    **env_vars) -> None:
    

    The cache is just another experiment json. it would be loaded only once at the very beginning only using the env_vars. any resulting objects would then be added to env_vars when running each each experiment. objects can optionally implement a Resettable class that has a reset method that would be called once before each experiment.

    incorrect usage of this feature could lead to non-reproducibility issues, but through docs we could make it clear this should only be for read-only objects. i think it would be worth doing...

    opened by kireet 1
  • cleanup config tests, also fixes #28

    cleanup config tests, also fixes #28

    i wanted to try to make the config tests a bit more sane, try to minimize the number of temporary classes we needed to create and improve naming. also found issue #28 and fixed it.

    opened by kireet 1
  • unsubstituted parameter doesn't cause an error

    unsubstituted parameter doesn't cause an error

    something like this won't cause a problem:

    { 
       "item": {
           "_name": "foo",
           "param":"$bar"
        }
    }
    

    even if we don't set a value for bar. this can lead to easily misconfigured objects.

    opened by kireet 1
  • Ioc refactor

    Ioc refactor

    • Refactor the basic trainer in an IoC pattern, with a single registry for every registrable classes, allowing for maximum customization
    • Separate the example experiments from the library
    • Adapt the examples to the new logic
    • Set cuda as optional in the config file
    opened by petermartigny 1
  • TPU + 16 bit

    TPU + 16 bit

    hey!

    Not sure if you've seen: https://github.com/williamFalcon/pytorch-lightning.

    The fastest growing PyTorch front-end project.

    We're also now venture funded so we have a fulltime team working on this and will be around for a very long time :)

    https://medium.com/pytorch/pytorch-lightning-0-7-1-release-and-venture-funding-dd12b2e75fb3?postPublishedType=repub

    opened by williamFalcon 0
  • Optional torch imports for trainers

    Optional torch imports for trainers

    We import torch modules in the __init__.py of trainers. This PR makes these imports optional, in the case where we don't have torch installed but still want to use the base TrainerABC class

    opened by petermartigny 0
  • move trainerABC to separate file

    move trainerABC to separate file

    This PR moves the TrainerABC class to a separate file. Therefore, someone willing to use the experiment runner class can do so without having to install torch

    opened by petermartigny 0
  • Refactor/experiment config

    Refactor/experiment config

    This PR does the refactoring defined in #76 to have a more easily maintainable configuration logic.

    Also, we remove the pytorch modules that were included in the registry by default. This allows for non-DL projects to use the config part f the library.

    opened by petermartigny 0
  • simplify configs reporting

    simplify configs reporting

    This PR does a few things:

    • Get rid of ini .cfg files saving
    • Before doing the sequential experiments, we copy the configs, experiment and cache files to a global-reporting directory.
    • This global-reporting directory will also host the outputs from the reporter's report_globally() call
    opened by petermartigny 0
  • [ExperimentRunner] Default value of experiment_cache cause run_all to fail

    [ExperimentRunner] Default value of experiment_cache cause run_all to fail

    Describe the bug The ExperimentRunner.run_all fails if experiment_cache is None.

    The issue comes from line 109, where the default value for the experiment cache (None) is not handled correctly: https://github.com/feedly/transfer-nlp/blob/master/transfer_nlp/runner/experiment_runner.py#L109

    opened by Mathieu4141 0
  • Check that all registrables are registered

    Check that all registrables are registered

    Currently, objects are built one by one and when one fails it throws an error.

    It would be great to have a quick pass before instantiating objects to check that all registrable names / aliases are actually registered, and throw an error at this moment.

    opened by petermartigny 0
  • Downloader Plugin

    Downloader Plugin

    From the talk today, one good point was the point that reproducibility problems often stem from data inconsistencies. To that end, I think we should have a DataDownloader component that can download data from URLs and save them locally to disk.

    • If the files exist, the downloader can skip the download
    • the downloader should calculate checksums for downloaded files. it should produce a checksums.cfg file to simplify reusing these in configuration later
    • the downloader should allow checksums to be configured in the experiment file. when set, the downloader would verify the downloaded file is the same as the one specified in the experiment.

    so an example json config could be:

    {
      "_name": "Downloader",
      "local_dir": "$my_path",
      "checksums": "$WORK_DIR/checksums_2019_05_23.cfg", <-- produced by a previous download 
      "sentences.txt.gz": {
        "url": "$BASE_URL/sentences.txt.gz",
        "decompress": true
      },
      "word_embeddings.npy": {
        "url": "$BASE_URL/word_embeddings.npy"
      }
    }
    
    opened by kireet 1
Releases(v0.1.6)
  • v0.1.5(Jun 25, 2019)

  • v0.1.3(May 29, 2019)

  • v0.1.2(May 28, 2019)

  • v0.1.1(May 28, 2019)

  • v0.1(May 28, 2019)

    This is a first stable version for Transfer NLP, allowing users to:

    Keep track of experiments and enforce reproducible research Combine custom and open-source code into controlled experiments Here are a few features available in the release:

    Configuring all objects from an experiment using a json file Running sequential jobs for the same experiment using different sets of parameters (parameter tuning, ablation studies...) Keep track of your experiments and make them reproducible / incrementally improvable Allow dynamic re-creation of any instantiated object during training through object factories Use several basic building blocks: Vocabulary class, PyTorch optimizer, Predictors... Transfer Learning: use the BasicTrainer to fine-tune pre-trained models to your custom downstream tasks.

    Source code(tar.gz)
    Source code(zip)
Owner
Feedly
Feedly
Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

Graph-Bert Source code of "Graph-Bert: Only Attention is Needed for Learning Graph Representations". Please check the script.py as the entry point. We

14 Mar 25, 2022
Active learning for text classification in Python

Active Learning allows you to efficiently label training data in a small-data scenario.

Webis 375 Dec 28, 2022
BiQE: Code and dataset for the BiQE paper

BiQE: Bidirectional Query Embedding This repository includes code for BiQE and the datasets introduced in Answering Complex Queries in Knowledge Graph

Bhushan Kotnis 1 Oct 20, 2021
:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

R²SQL The PyTorch implementation of paper Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing. (AAAI 2021) Requirement

huybery 60 Dec 31, 2022
中文无监督SimCSE Pytorch实现

A PyTorch implementation of unsupervised SimCSE SimCSE: Simple Contrastive Learning of Sentence Embeddings 1. 用法 无监督训练 python train_unsup.py ./data/ne

99 Dec 23, 2022
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022
PyTranslator é simultaneamente um editor e tradutor de texto com diversos recursos e interface feito com coração e 100% em Python

PyTranslator O Que é e para que serve o PyTranslator? PyTranslator é simultaneamente um editor e tradutor de texto em com interface gráfica que usa a

Elizeu Barbosa Abreu 1 May 12, 2022
An extension for asreview implements a version of the tf-idf feature extractor that saves the matrix and the vocabulary.

Extension - matrix and vocabulary extractor for TF-IDF and Doc2Vec An extension for ASReview that adds a tf-idf extractor that saves the matrix and th

ASReview 4 Jun 17, 2022
KoBERT - Korean BERT pre-trained cased (KoBERT)

KoBERT KoBERT Korean BERT pre-trained cased (KoBERT) Why'?' Training Environment Requirements How to install How to use Using with PyTorch Using with

SK T-Brain 1k Jan 02, 2023
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 01, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022
NLP-based analysis of poor Chinese movie reviews on Douban

douban_embedding 豆瓣中文影评差评分析 1. NLP NLP(Natural Language Processing)是指自然语言处理,他的目的是让计算机可以听懂人话。 下面是我将2万条豆瓣影评训练之后,随意输入一段新影评交给神经网络,最终AI推断出的结果。 "很好,演技不错

3 Apr 15, 2022
Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks

TestRank in Pytorch Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks by Yu Li, Min Li, Qiuxia Lai, Ya

3 May 19, 2022
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022
Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob

Rui Wang 6k Jan 02, 2023
A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇

272 Dec 15, 2022
novel deep learning research works with PaddlePaddle

Research 发布基于飞桨的前沿研究工作,包括CV、NLP、KG、STDM等领域的顶会论文和比赛冠军模型。 目录 计算机视觉(Computer Vision) 自然语言处理(Natrual Language Processing) 知识图谱(Knowledge Graph) 时空数据挖掘(Spa

1.5k Jan 03, 2023
DaCy: The State of the Art Danish NLP pipeline using SpaCy

DaCy: A SpaCy NLP Pipeline for Danish DaCy is a Danish preprocessing pipeline trained in SpaCy. At the time of writing it has achieved State-of-the-Ar

Kenneth Enevoldsen 71 Jan 06, 2023
Bu Chatbot, Konya Bilim Merkezi Yen için tasarlanmış olan bir projedir.

chatbot Bu Chatbot, Konya Bilim Merkezi Yeni Ufuklar Sergisi için 2021 Yılında tasarlanmış olan bir projedir. Chatbot Python ortamında yazılmıştır. Sö

Emre Özkul 1 Feb 23, 2022
Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

AAGCN-ACSA EMNLP 2021 Introduction This repository was used in our paper: Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment An

Akuchi 36 Dec 18, 2022