This repository contains the code for "Generating Datasets with Pretrained Language Models".

Related tags

Text Data & NLPdino
Overview

Datasets from Instructions (DINO 🦕 )

This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces a method called Datasets from Instructions (DINO 🦕 ) that enables pretrained language models to generate entire datasets from scratch.

🔧 Setup

All requirements for DINO can be found in requirements.txt. You can install all required packages in a new environment with pip install -r requirements.txt.

💬 CLI Usage

Single Texts

To generate datasets for (single) text classification, you can use DINO as follows:

python3 dino.py \
 --output_dir <OUTPUT_DIR> \
 --task_file <TASK_FILE> \
 --num_entries_per_label <N>

where <OUTPUT_DIR> is a directory to which the generated dataset is written, <TASK_FILE> is a JSON file containing a task specification (see Task Specs), and <N> is the number of examples to generate per label. To get an overview of additional parameters, run python3 dino.py --help.

Text Pairs

To generate datasets for text pair classification, you first need a dataset of raw input texts (which you can also generate using DINO). You can then run

python3 dino.py \
 --output_dir <OUTPUT_DIR> \
 --task_file <TASK_FILE> \
 --input_file <INPUT_FILE> \
 --input_file_type <INPUT_FILE_TYPE> \
 --num_entries_per_input_and_label <N>

with <OUTPUT_DIR> and <TASK_FILE> as before. <INPUT_FILE> refers to the file containing raw input texts, <INPUT_FILE_TYPE> specifies its type, which should be one of

  • plain: for a plain text file with one input text per line
  • jsonl: for a dataset file generated by DINO in a previous step

and <N> is the number of examples to generate per label and input text.

📋 Task Specs

🚨 Before you write custom task specifications, please note that this is still a very early release and we have not tested DINO on other tasks than semantic textual similarity yet. Please let us know if you see something strange. 🚨

To generate a dataset for a task, you need to provide a file containing a task specification, containing (among other things) the instructions given to the pretrained language model. A task specification is a single JSON object that looks like this:

{
  "task_name": "<TASK_NAME>",
  "labels": {
    "<LABEL_1>": {
      "instruction": "<INSTRUCTION_1>",
      "counter_labels": [<COUNTER_LABELS_1>]
    },

    ...,

    "<LABEL_n>": {
      "instruction": "<INSTRUCTION_n>",
      "counter_labels": [<COUNTER_LABELS_n>]
    }
  }
}

Here, <TASK_NAME> is the name for the task and <LABEL_1>, ..., <LABEL_n> are the task's labels. For each label <LABEL_i>, <INSTRUCTION_i> is the instruction provided to the language model for generating examples with label <LABEL_i> (see Writing Instructions). You can additionally specify a list of counter labels <COUNTER_LABELS_n> for each label. This tells the model to generate outputs that are not only likely given the current label, but also unlikely given all counter labels (see the paper for details).

Examples

You can find two examples of task specifications in /task_specs:

  • sts.json is a task specification for generating a semantic textual similarity dataset if a set of raw input texts is already given.
  • sts-x1.json is a task specification for generating a set of raw input texts. This set can then be used in a subsequent step to generate a full STS dataset using sts.json.

Writing Instructions

When writing instructions for a new task, you should consider the following things:

  • Always end your instructions with an (opening) quotation mark ("). This is required because it allows us to interpret the next quotation mark generated by the language model as a signal that it is done generating an example.
  • For good results, keep the instructions as short and simple as possible as this makes it easier for a pretrained language model to understand them.
  • If you are writing instructions for a text pair classification task, make sure that each instruction contains the placeholder <X1> exactly once. At this position, the provided raw input sentences are inserted during generation.

An example for an instruction that prompts the model to generate a positive review for a restaurant would be:

Task: Write a review for a really great restaurant.
Review: "

An example for an instruction that prompts the model to generate a sentence that has the same meaning as another given sentence would be:

Task: Write two sentences that mean the same thing.
Sentence 1: "<X1>"
Sentence 2: "

🦕 Generated DINOs

In this section, we will soon make publicly available a list of datasets that we have generated using DINO.

📕 Citation

If you make use of the code in this repository or of any DINO-based dataset, please cite the following paper:

@article{schick2020generating,
  title={Generating Datasets with Pretrained Language Models},
  author={Timo Schick and Hinrich Schütze},
  journal={Computing Research Repository},
  volume={arXiv:2104.07540},
  url={https://arxiv.org/abs/2104.07540},
  year={2021}
}
Owner
Timo Schick
NLP Researcher @ SulzerGmbH , PhD Student @ CIS, LMU Munich
Timo Schick
Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

fake-news-detector-1.0 Lists, lists and more lists... Spam filter list, quality keyword list, stoplist list, top-domains urls list, news agencies webs

Memo Sim 1 Jan 04, 2022
A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

A2T: Towards Improving Adversarial Training of NLP Models This is the source code for the EMNLP 2021 (Findings) paper "Towards Improving Adversarial T

QData 17 Oct 15, 2022
ChessCoach is a neural network-based chess engine capable of natural-language commentary.

ChessCoach is a neural network-based chess engine capable of natural-language commentary.

Chris Butner 380 Dec 03, 2022
Residual2Vec: Debiasing graph embedding using random graphs

Residual2Vec: Debiasing graph embedding using random graphs This repository contains the code for S. Kojaku, J. Yoon, I. Constantino, and Y.-Y. Ahn, R

SADAMORI KOJAKU 5 Oct 12, 2022
Sequence modeling benchmarks and temporal convolutional networks

Sequence Modeling Benchmarks and Temporal Convolutional Networks (TCN) This repository contains the experiments done in the work An Empirical Evaluati

CMU Locus Lab 3.5k Jan 03, 2023
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow 🤗 Transformers provides thousands of pretrained mo

Hugging Face 77.2k Jan 03, 2023
ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in

241 Jan 04, 2023
Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

Manolo 1 Aug 15, 2022
A Paper List for Speech Translation

Keyword: Speech Translation, Spoken Language Processing, Natural Language Processing

138 Dec 24, 2022
**NSFW** A chatbot based on GPT2-chitchat

DangBot -- 好怪哦,再来一句 卡群怪话bot,powered by GPT2 for Chinese chitchat Training Example: python train.py --lr 5e-2 --epochs 30 --max_len 300 --batch_size 8

Tommy Yang 11 Jul 21, 2022
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

Introduction Funnel-Transformer is a new self-attention model that gradually compresses the sequence of hidden states to a shorter one and hence reduc

GUOKUN LAI 197 Dec 11, 2022
Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Lau 1 Dec 17, 2021
LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation Tasks | Datasets | LongLM | Baselines | Paper Introduction LOT is a ben

46 Dec 28, 2022
LUKE -- Language Understanding with Knowledge-based Embeddings

LUKE (Language Understanding with Knowledge-based Embeddings) is a new pre-trained contextualized representation of words and entities based on transf

Studio Ousia 587 Dec 30, 2022
Pre-Training with Whole Word Masking for Chinese BERT

Pre-Training with Whole Word Masking for Chinese BERT

Yiming Cui 7.7k Dec 31, 2022
FireFlyer Record file format, writer and reader for DL training samples.

FFRecord The FFRecord format is a simple format for storing a sequence of binary records developed by HFAiLab, which supports random access and Linux

77 Jan 04, 2023
Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

Visualize, analyze, and explore NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BER

Jay Alammar 1.6k Dec 25, 2022
NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

Code has been run on Google Colab, thanks Google for providing computational resources Contents Natural Language Processing(自然语言处理) Text Classificati

1.5k Nov 14, 2022
Use fastai-v2 with HuggingFace's pretrained transformers

FastHugs Use fastai v2 with HuggingFace's pretrained transformers, see the notebooks below depending on your task: Text classification: fasthugs_seq_c

Morgan McGuire 111 Nov 16, 2022
Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

Samuel Cahyawijaya 11 Aug 26, 2022