This repository contains the code for "Generating Datasets with Pretrained Language Models".

Related tags

Text Data & NLPdino
Overview

Datasets from Instructions (DINO 🦕 )

This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces a method called Datasets from Instructions (DINO 🦕 ) that enables pretrained language models to generate entire datasets from scratch.

🔧 Setup

All requirements for DINO can be found in requirements.txt. You can install all required packages in a new environment with pip install -r requirements.txt.

💬 CLI Usage

Single Texts

To generate datasets for (single) text classification, you can use DINO as follows:

python3 dino.py \
 --output_dir <OUTPUT_DIR> \
 --task_file <TASK_FILE> \
 --num_entries_per_label <N>

where <OUTPUT_DIR> is a directory to which the generated dataset is written, <TASK_FILE> is a JSON file containing a task specification (see Task Specs), and <N> is the number of examples to generate per label. To get an overview of additional parameters, run python3 dino.py --help.

Text Pairs

To generate datasets for text pair classification, you first need a dataset of raw input texts (which you can also generate using DINO). You can then run

python3 dino.py \
 --output_dir <OUTPUT_DIR> \
 --task_file <TASK_FILE> \
 --input_file <INPUT_FILE> \
 --input_file_type <INPUT_FILE_TYPE> \
 --num_entries_per_input_and_label <N>

with <OUTPUT_DIR> and <TASK_FILE> as before. <INPUT_FILE> refers to the file containing raw input texts, <INPUT_FILE_TYPE> specifies its type, which should be one of

  • plain: for a plain text file with one input text per line
  • jsonl: for a dataset file generated by DINO in a previous step

and <N> is the number of examples to generate per label and input text.

📋 Task Specs

🚨 Before you write custom task specifications, please note that this is still a very early release and we have not tested DINO on other tasks than semantic textual similarity yet. Please let us know if you see something strange. 🚨

To generate a dataset for a task, you need to provide a file containing a task specification, containing (among other things) the instructions given to the pretrained language model. A task specification is a single JSON object that looks like this:

{
  "task_name": "<TASK_NAME>",
  "labels": {
    "<LABEL_1>": {
      "instruction": "<INSTRUCTION_1>",
      "counter_labels": [<COUNTER_LABELS_1>]
    },

    ...,

    "<LABEL_n>": {
      "instruction": "<INSTRUCTION_n>",
      "counter_labels": [<COUNTER_LABELS_n>]
    }
  }
}

Here, <TASK_NAME> is the name for the task and <LABEL_1>, ..., <LABEL_n> are the task's labels. For each label <LABEL_i>, <INSTRUCTION_i> is the instruction provided to the language model for generating examples with label <LABEL_i> (see Writing Instructions). You can additionally specify a list of counter labels <COUNTER_LABELS_n> for each label. This tells the model to generate outputs that are not only likely given the current label, but also unlikely given all counter labels (see the paper for details).

Examples

You can find two examples of task specifications in /task_specs:

  • sts.json is a task specification for generating a semantic textual similarity dataset if a set of raw input texts is already given.
  • sts-x1.json is a task specification for generating a set of raw input texts. This set can then be used in a subsequent step to generate a full STS dataset using sts.json.

Writing Instructions

When writing instructions for a new task, you should consider the following things:

  • Always end your instructions with an (opening) quotation mark ("). This is required because it allows us to interpret the next quotation mark generated by the language model as a signal that it is done generating an example.
  • For good results, keep the instructions as short and simple as possible as this makes it easier for a pretrained language model to understand them.
  • If you are writing instructions for a text pair classification task, make sure that each instruction contains the placeholder <X1> exactly once. At this position, the provided raw input sentences are inserted during generation.

An example for an instruction that prompts the model to generate a positive review for a restaurant would be:

Task: Write a review for a really great restaurant.
Review: "

An example for an instruction that prompts the model to generate a sentence that has the same meaning as another given sentence would be:

Task: Write two sentences that mean the same thing.
Sentence 1: "<X1>"
Sentence 2: "

🦕 Generated DINOs

In this section, we will soon make publicly available a list of datasets that we have generated using DINO.

📕 Citation

If you make use of the code in this repository or of any DINO-based dataset, please cite the following paper:

@article{schick2020generating,
  title={Generating Datasets with Pretrained Language Models},
  author={Timo Schick and Hinrich Schütze},
  journal={Computing Research Repository},
  volume={arXiv:2104.07540},
  url={https://arxiv.org/abs/2104.07540},
  year={2021}
}
Owner
Timo Schick
NLP Researcher @ SulzerGmbH , PhD Student @ CIS, LMU Munich
Timo Schick
The first online catalogue for Arabic NLP datasets.

Masader The first online catalogue for Arabic NLP datasets. This catalogue contains 200 datasets with more than 25 metadata annotations for each datas

ARBML 94 Dec 26, 2022
Code for the project carried out fulfilling the course requirements for Fall 2021 NLP at NYU

Introduction Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization,

Sai Himal Allu 1 Apr 25, 2022
Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

1.1k Dec 27, 2022
YACLC - Yet Another Chinese Learner Corpus

汉语学习者文本多维标注数据集YACLC V1.0 中文 | English 汉语学习者文本多维标注数据集(Yet Another Chinese Learner

BLCU-ICALL 47 Dec 15, 2022
Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

Jan 2 Apr 20, 2022
Chinese segmentation library

What is loso? loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ( Fang-Pen Lin 82 Jun 28, 2022

Unsupervised Language Model Pre-training for French

FlauBERT and FLUE FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the n

GETALP 212 Dec 10, 2022
A repo for materials relating to the tutorial of CS-332 NLP

CS-332-NLP A repo for materials relating to the tutorial of CS-332 NLP Contents Tutorial 1: Introduction Corpus Regular expression Tokenization Tutori

Alok singh 9 Feb 15, 2022
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Facebook Research 409 Oct 28, 2022
Use the state-of-the-art m2m100 to translate large data on CPU/GPU/TPU. Super Easy!

Easy-Translate is a script for translating large text files in your machine using the M2M100 models from Facebook/Meta AI. We also privide a script fo

Iker García-Ferrero 41 Dec 15, 2022
A simple visual front end to the Maya UE4 RBF plugin delivered with MetaHumans

poseWrangler Overview PoseWrangler is a simple UI to create and edit pose-driven relationships in Maya using the MayaUE4RBF plugin. This plugin is dis

Christopher Evans 105 Dec 18, 2022
Unofficial Python library for using the Polish Wordnet (plWordNet / Słowosieć)

Polish Wordnet Python library Simple, easy-to-use and reasonably fast library for using the Słowosieć (also known as PlWordNet) - a lexico-semantic da

Max Adamski 12 Dec 23, 2022
Utilities for preprocessing text for deep learning with Keras

Note: This utility is really old and is no longer maintained. You should use keras.layers.TextVectorization instead of this. Utilities for pre-process

Hamel Husain 180 Dec 09, 2022
Kerberoast with ACL abuse capabilities

targetedKerberoast targetedKerberoast is a Python script that can, like many others (e.g. GetUserSPNs.py), print "kerberoast" hashes for user accounts

Shutdown 213 Dec 22, 2022
使用pytorch+transformers复现了SimCSE论文中的有监督训练和无监督训练方法

SimCSE复现 项目描述 SimCSE是一种简单但是很巧妙的NLP对比学习方法,创新性地引入Dropout的方式,对样本添加噪声,从而达到对正样本增强的目的。 该框架的训练目的为:对于batch中的每个样本,拉近其与正样本之间的距离,拉远其与负样本之间的距离,使得模型能够在大规模无监督语料(也可以

58 Dec 20, 2022
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
BiNE: Bipartite Network Embedding

BiNE: Bipartite Network Embedding This repository contains the demo code of the paper: BiNE: Bipartite Network Embedding. Ming Gao, Leihui Chen, Xiang

leihuichen 214 Nov 24, 2022
Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

IMDB Sentiment Analysis This is the final project of Machine Learning Courses in Huazhong University of Science and Technology, School of Artificial I

Daniel 0 Dec 27, 2021
DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task。涵盖68个领域、共计916万词的专业词典知识库,可用于文本分类、知识增强、领域词汇库扩充等自然语言处理应用。

liuhuanyong 357 Dec 24, 2022
BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by

DMIS Laboratory - Korea University 99 Jan 06, 2023