This repository contains the code for "Generating Datasets with Pretrained Language Models".

Related tags

Text Data & NLPdino
Overview

Datasets from Instructions (DINO 🦕 )

This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces a method called Datasets from Instructions (DINO 🦕 ) that enables pretrained language models to generate entire datasets from scratch.

🔧 Setup

All requirements for DINO can be found in requirements.txt. You can install all required packages in a new environment with pip install -r requirements.txt.

💬 CLI Usage

Single Texts

To generate datasets for (single) text classification, you can use DINO as follows:

python3 dino.py \
 --output_dir <OUTPUT_DIR> \
 --task_file <TASK_FILE> \
 --num_entries_per_label <N>

where <OUTPUT_DIR> is a directory to which the generated dataset is written, <TASK_FILE> is a JSON file containing a task specification (see Task Specs), and <N> is the number of examples to generate per label. To get an overview of additional parameters, run python3 dino.py --help.

Text Pairs

To generate datasets for text pair classification, you first need a dataset of raw input texts (which you can also generate using DINO). You can then run

python3 dino.py \
 --output_dir <OUTPUT_DIR> \
 --task_file <TASK_FILE> \
 --input_file <INPUT_FILE> \
 --input_file_type <INPUT_FILE_TYPE> \
 --num_entries_per_input_and_label <N>

with <OUTPUT_DIR> and <TASK_FILE> as before. <INPUT_FILE> refers to the file containing raw input texts, <INPUT_FILE_TYPE> specifies its type, which should be one of

  • plain: for a plain text file with one input text per line
  • jsonl: for a dataset file generated by DINO in a previous step

and <N> is the number of examples to generate per label and input text.

📋 Task Specs

🚨 Before you write custom task specifications, please note that this is still a very early release and we have not tested DINO on other tasks than semantic textual similarity yet. Please let us know if you see something strange. 🚨

To generate a dataset for a task, you need to provide a file containing a task specification, containing (among other things) the instructions given to the pretrained language model. A task specification is a single JSON object that looks like this:

{
  "task_name": "<TASK_NAME>",
  "labels": {
    "<LABEL_1>": {
      "instruction": "<INSTRUCTION_1>",
      "counter_labels": [<COUNTER_LABELS_1>]
    },

    ...,

    "<LABEL_n>": {
      "instruction": "<INSTRUCTION_n>",
      "counter_labels": [<COUNTER_LABELS_n>]
    }
  }
}

Here, <TASK_NAME> is the name for the task and <LABEL_1>, ..., <LABEL_n> are the task's labels. For each label <LABEL_i>, <INSTRUCTION_i> is the instruction provided to the language model for generating examples with label <LABEL_i> (see Writing Instructions). You can additionally specify a list of counter labels <COUNTER_LABELS_n> for each label. This tells the model to generate outputs that are not only likely given the current label, but also unlikely given all counter labels (see the paper for details).

Examples

You can find two examples of task specifications in /task_specs:

  • sts.json is a task specification for generating a semantic textual similarity dataset if a set of raw input texts is already given.
  • sts-x1.json is a task specification for generating a set of raw input texts. This set can then be used in a subsequent step to generate a full STS dataset using sts.json.

Writing Instructions

When writing instructions for a new task, you should consider the following things:

  • Always end your instructions with an (opening) quotation mark ("). This is required because it allows us to interpret the next quotation mark generated by the language model as a signal that it is done generating an example.
  • For good results, keep the instructions as short and simple as possible as this makes it easier for a pretrained language model to understand them.
  • If you are writing instructions for a text pair classification task, make sure that each instruction contains the placeholder <X1> exactly once. At this position, the provided raw input sentences are inserted during generation.

An example for an instruction that prompts the model to generate a positive review for a restaurant would be:

Task: Write a review for a really great restaurant.
Review: "

An example for an instruction that prompts the model to generate a sentence that has the same meaning as another given sentence would be:

Task: Write two sentences that mean the same thing.
Sentence 1: "<X1>"
Sentence 2: "

🦕 Generated DINOs

In this section, we will soon make publicly available a list of datasets that we have generated using DINO.

📕 Citation

If you make use of the code in this repository or of any DINO-based dataset, please cite the following paper:

@article{schick2020generating,
  title={Generating Datasets with Pretrained Language Models},
  author={Timo Schick and Hinrich Schütze},
  journal={Computing Research Repository},
  volume={arXiv:2104.07540},
  url={https://arxiv.org/abs/2104.07540},
  year={2021}
}
Owner
Timo Schick
NLP Researcher @ SulzerGmbH , PhD Student @ CIS, LMU Munich
Timo Schick
PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

SITT The repo contains official PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation. Authors: Boyi Li Yin Cui T

Boyi Li 52 Jan 05, 2023
Python utility library for compositing PDF documents with reportlab.

pdfdoc-py Python utility library for compositing PDF documents with reportlab. Installation The pdfdoc-py package can be installed directly from the s

Michael Gale 1 Jan 06, 2022
pytorch implementation of Attention is all you need

A Pytorch Implementation of the Transformer: Attention Is All You Need Our implementation is largely based on Tensorflow implementation Requirements N

230 Dec 07, 2022
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipB

Jie Lei 雷杰 612 Jan 04, 2023
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 3k Jan 08, 2023
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021
A music comments dataset, containing 39,051 comments for 27,384 songs.

Music Comments Dataset A music comments dataset, containing 39,051 comments for 27,384 songs. For academic research use only. Introduction This datase

Zhang Yixiao 2 Jan 10, 2022
Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

msg systems ag 169 Dec 21, 2022
Shirt Bot is a discord bot which uses GPT-3 to generate text

SHIRT BOT · Shirt Bot is a discord bot which uses GPT-3 to generate text. Made by Cyclcrclicly#3420 (474183744685604865) on Discord. Support Server EX

31 Oct 31, 2022
Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3

Fork from https://github.com/huggingface/transformers/tree/86d5fb0b360e68de46d40265e7c707fe68c8015b/examples/pytorch/language-modeling at 2021.05.17.

Junbum Lee 12 Oct 26, 2022
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
Repository of the Code to Chatbots, developed in Python

Description In this repository you will find the Code to my Chatbots, developed in Python. I'll explain the structure of this Repository later. Requir

Li-am K. 0 Oct 25, 2022
HF's ML for Audio study group

Hugging Face Machine Learning for Audio Study Group Welcome to the ML for Audio Study Group. Through a series of presentations, paper reading and disc

Vaibhav Srivastav 110 Jan 01, 2023
Rhyme with AI

Local development Create a conda virtual environment and activate it: conda env create --file environment.yml conda activate rhyme-with-ai Install the

GoDataDriven 28 Nov 21, 2022
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

The implementation of paper CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. CLIP4Clip is a video-text retrieval model based

ArrowLuo 456 Jan 06, 2023
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

Simplemma: a simple multilingual lemmatizer for Python Purpose Lemmatization is the process of grouping together the inflected forms of a word so they

Adrien Barbaresi 70 Dec 29, 2022
SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Introduction This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper. Chen, Jia, et al. "Axiomatically Re

Jia Chen 17 Nov 09, 2022
Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models

Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.

Prithivida 681 Jan 01, 2023
Text-Based zombie apocalyptic decision-making game in Python

Inspiration We shared university first year game coursework.[to gauge previous experience and start brainstorming] Adapted a particular nuclear fallou

Amin Sabbagh 2 Feb 17, 2022
Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

Ekstra Bladet 141 Dec 30, 2022