This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"

Overview

Pattern-Exploiting Training (PET)

This repository contains the code for Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference and It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners. The papers introduce pattern-exploiting training (PET), a semi-supervised training procedure that reformulates input examples as cloze-style phrases. In low-resource settings, PET and iPET significantly outperform regular supervised training, various semi-supervised baselines and even GPT-3 despite requiring 99.9% less parameters. The iterative variant of PET (iPET) trains multiple generations of models and can even be used without any training data.

#Examples Training Mode Yelp (Full) AG's News Yahoo Questions MNLI
0 unsupervised 33.8 69.5 44.0 39.1
iPET 56.7 87.5 70.7 53.6
100 supervised 53.0 86.0 62.9 47.9
PET 61.9 88.3 69.2 74.7
iPET 62.9 89.6 71.2 78.4

Note: To exactly reproduce the above results, make sure to use v1.1.0 (--branch v1.1.0).

📑 Contents

🔧 Setup

💬 CLI Usage

💻 API Usage

🐶 Train your own PET

📕 Citation

🔧 Setup

All requirements for PET can be found in requirements.txt. You can install all required packages with pip install -r requirements.txt.

💬 CLI Usage

The command line interface cli.py in this repository currently supports three different training modes (PET, iPET, supervised training), two additional evaluation methods (unsupervised and priming) and 13 different tasks. For Yelp Reviews, AG's News, Yahoo Questions, MNLI and X-Stance, see the original paper for further details. For the 8 SuperGLUE tasks, see this paper.

PET Training and Evaluation

To train and evaluate a PET model for one of the supported tasks, simply run the following command:

python3 cli.py \
--method pet \
--pattern_ids $PATTERN_IDS \
--data_dir $DATA_DIR \
--model_type $MODEL_TYPE \
--model_name_or_path $MODEL_NAME_OR_PATH \
--task_name $TASK \
--output_dir $OUTPUT_DIR \
--do_train \
--do_eval

where

  • $PATTERN_IDS specifies the PVPs to use. For example, if you want to use all patterns, specify PATTERN_IDS 0 1 2 3 4 for AG's News and Yahoo Questions or PATTERN_IDS 0 1 2 3 for Yelp Reviews and MNLI.
  • $DATA_DIR is the directory containing the train and test files (check tasks.py to see how these files should be named and formatted for each task).
  • $MODEL_TYPE is the name of the model being used, e.g. albert, bert or roberta.
  • $MODEL_NAME is the name of a pretrained model (e.g., roberta-large or albert-xxlarge-v2) or the path to a pretrained model.
  • $TASK_NAME is the name of the task to train and evaluate on.
  • $OUTPUT_DIR is the name of the directory in which the trained model and evaluation results are saved.

You can additionally specify various training parameters for both the ensemble of PET models corresponding to individual PVPs (prefix --pet_) and for the final sequence classification model (prefix --sc_). For example, the default parameters used for our SuperGLUE evaluation are:

--pet_per_gpu_eval_batch_size 8 \
--pet_per_gpu_train_batch_size 2 \
--pet_gradient_accumulation_steps 8 \
--pet_max_steps 250 \
--pet_max_seq_length 256 \
--pet_repetitions 3 \
--sc_per_gpu_train_batch_size 2 \
--sc_per_gpu_unlabeled_batch_size 2 \
--sc_gradient_accumulation_steps 8 \
--sc_max_steps 5000 \
--sc_max_seq_length 256 \
--sc_repetitions 1

For each pattern $P and repetition $I, running the above command creates a directory $OUTPUT_DIR/p$P-i$I that contains the following files:

  • pytorch_model.bin: the finetuned model, possibly along with some model-specific files (e.g, spiece.model, special_tokens_map.json)
  • wrapper_config.json: the configuration of the model being used
  • train_config.json: the configuration used for training
  • eval_config.json: the configuration used for evaluation
  • logits.txt: the model's predictions on the unlabeled data
  • eval_logits.txt: the model's prediction on the evaluation data
  • results.json: a json file containing results such as the model's final accuracy
  • predictions.jsonl: a prediction file for the evaluation set in the SuperGlue format

The final (distilled) model for each repetition $I can be found in $OUTPUT_DIR/final/p0-i$I, which contains the same files as described above.

🚨 If your GPU runs out of memory during training, you can try decreasing both the pet_per_gpu_train_batch_size and the sc_per_gpu_unlabeled_batch_size while increasing both pet_gradient_accumulation_steps and sc_gradient_accumulation_steps.

iPET Training and Evaluation

To train and evaluate an iPET model for one of the supported tasks, simply run the same command as above, but replace --method pet with --method ipet. There are various additional iPET parameters that you can modify; all of them are prefixed with --ipet_.

For each generation $G, pattern $P and iteration $I, this creates a directory $OUTPUT_DIR/g$G/p$P-i$I that is structured as for regular PET. The final (distilled) model can again be found in $OUTPUT_DIR/final/p0-i$I.

🚨 If you use iPET with zero training examples, you need to specify how many examples for each label should be chosen in the first generation and you need to change the reduction strategy to mean: --ipet_n_most_likely 100 --reduction mean.

Supervised Training and Evaluation

To train and evaluate a regular sequence classifier in a supervised fashion, simply run the same command as above, but replace --method pet with --method sequence_classifier. There are various additional parameters for the sequence classifier that you can modify; all of them are prefixed with --sc_.

Unsupervised Evaluation

To evaluate a pretrained language model with the default PET patterns and verbalizers, but without fine-tuning, remove the argument --do_train and add --no_distillation so that no final distillation is performed.

Priming

If you want to use priming, remove the argument --do_train and add the arguments --priming --no_distillation so that all training examples are used for priming and no final distillation is performed.

🚨 Remember that you may need to increase the maximum sequence length to a much larger value, e.g. --pet_max_seq_length 5000. This only works with language models that support such long sequences, e.g. XLNet. For using XLNet, you can specify --model_type xlnet --model_name_or_path xlnet-large-cased --wrapper_type plm.

💻 API Usage

Instead of using the command line interface, you can also directly use the PET API, most of which is defined in pet.modeling. By including import pet, you can access methods such as train_pet, train_ipet and train_classifier. Check out their documentation for more information.

🐶 Train your own PET

To use PET for custom tasks, you need to define two things:

  • a DataProcessor, responsible for loading training and test data. See examples/custom_task_processor.py for an example.
  • a PVP, responsible for applying patterns to inputs and mapping labels to natural language verbalizations. See examples/custom_task_pvp.py for an example.

After having implemented the DataProcessor and the PVP, you can train a PET model using the command line as described above. Below, you can find additional information on how to define the two components of a PVP, verbalizers and patterns.

Verbalizers

Verbalizers are used to map task labels to words in natural language. For example, in a binary sentiment classification task, you could map the positive label (+1) to the word good and the negative label (-1) to the word bad. Verbalizers are realized through a PVP's verbalize() method. The simplest way of defining a verbalizer is to use a dictionary:

VERBALIZER = {"+1": ["good"], "-1": ["bad"]}
    
def verbalize(self, label) -> List[str]:
    return self.VERBALIZER[label]       

Importantly, in PET's current version, verbalizers are by default restricted to single tokens in the underlying LMs vocabulary (for using more than one token, see below). Given a language model's tokenizer, you can easily check whether a word corresponds to a single token by verifying that len(tokenizer.tokenize(word)) == 1.

You can also define multiple verbalizations for a single label. For example, if you are unsure which words best represent the labels in a binary sentiment classification task, you could define your verbalizer as follows:

VERBALIZER = {"+1": ["great", "good", "wonderful", "perfect"], "-1": ["bad", "terrible", "horrible"]}

Patterns

Patterns are used to make the language model understand a given task; they must contain exactly one <MASK> token which is to be filled using the verbalizer. For binary sentiment classification based on a review's summary (<A>) and body (<B>), a suitable pattern may be <A>. <B>. Overall, it was <MASK>. Patterns are realized through a PVP's get_parts() method, which returns a pair of text sequences (where each sequence is represented by a list of strings):

def get_parts(self, example: InputExample):
    return [example.text_a, '.', example.text_b, '.'], ['Overall, it was ', self.mask]

If you do not want to use a pair of sequences, you can simply leave the second sequence empty:

def get_parts(self, example: InputExample):
    return [example.text_a, '.', example.text_b, '. Overall, it was ', self.mask], []

If you want to define several patterns, simply use the PVPs pattern_id attribute:

def get_parts(self, example: InputExample):
    if self.pattern_id == 1:
        return [example.text_a, '.', example.text_b, '.'], ['Overall, it was ', self.mask]
    elif self.pattern_id == 2:
        return ['It was just ', self.mask, '!', example.text_a, '.', example.text_b, '.'], []

When training the model using the command line, specify all patterns to be used (e.g., --pattern_ids 1 2).

Importantly, if a sequence is longer than the specified maximum sequence length of the underlying LM, PET must know which parts of the input can be shortened and which ones cannot (for example, the mask token must always be there). Therefore, PVP provides a shortenable() method to indicate that a piece of text can be shortened:

def get_parts(self, example: InputExample):
    text_a = self.shortenable(example.text_a)
    text_b = self.shortenable(example.text_b)
    return [text_a, '.', text_b, '. Overall, it was ', self.mask], []

PET with Multiple Masks

By default, the current implementation of PET and iPET only supports a fixed set of labels that is shared across all examples and verbalizers that correspond to a single token. However, for some tasks it may be necessary to use verbalizers that correspond to multiple tokens (as described here). To do so, you simply need the following two modifications:

  1. Add the following lines in your task's DataProcessor (see examples/custom_task_processor.py):

    from pet.tasks import TASK_HELPERS
    from pet.task_helpers import MultiMaskTaskHelper
    TASK_HELPERS['my_task'] = MultiMaskTaskHelper

    where 'my_task' is the name of your task.

  2. In your PVP, make sure that the get_parts() method always inserts the maximum number of mask tokens required for any verbalization. For example, if your verbalizer maps +1 to "really awesome" and -1 to "terrible" and if those are tokenized as ["really", "awe", "##some"] and ["terrible"], respectively, your get_parts() method should always return a sequence that contains exactly 3 mask tokens.

With this modification, you can now use verbalizers consisting of multiple tokens:

VERBALIZER = {"+1": ["really good"], "-1": ["just bad"]}

However, there are several limitations to consider:

  • When using a MultiMaskTaskHelper, the maximum batch size for evaluation is 1.
  • As using multiple masks requires multiple forward passes during evaluation, the time required for evaluation scales about linearly with the length of the longest verbalizer. If you require verbalizers that consist of 10 or more tokens, using a generative LM might be a better approach.
  • The MultiMaskTaskHelper class is an experimental feature that is not thoroughly tested. In particular, this feature has only been tested for PET and not for iPET. If you observe something strange, please raise an issue.

For more flexibility, you can also write a custom TaskHelper. As a starting point, you can check out the classes CopaTaskHelper, WscTaskHelper and RecordTaskHelper in pet/task_helpers.py.

📕 Citation

If you make use of the code in this repository, please cite the following papers:

@article{schick2020exploiting,
  title={Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference},
  author={Timo Schick and Hinrich Schütze},
  journal={Computing Research Repository},
  volume={arXiv:2001.07676},
  url={http://arxiv.org/abs/2001.07676},
  year={2020}
}

@article{schick2020small,
  title={It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners},
  author={Timo Schick and Hinrich Schütze},
  journal={Computing Research Repository},
  volume={arXiv:2009.07118},
  url={http://arxiv.org/abs/2009.07118},
  year={2020}
}
Owner
Timo Schick
NLP Researcher @ SulzerGmbH , PhD Student @ CIS, LMU Munich
Timo Schick
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 01, 2023
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

18 Nov 28, 2022
Sequence-to-Sequence learning using PyTorch

Seq2Seq in PyTorch This is a complete suite for training sequence-to-sequence models in PyTorch. It consists of several models and code to both train

Elad Hoffer 514 Nov 17, 2022
Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022
PyTorch source code of NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models"

This repository contains source code for NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models" (P

Alexandra Chronopoulou 89 Aug 12, 2022
A deep learning-based translation library built on Huggingface transformers

DL Translate A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large 💻 GitHub Repository 📚 Documentat

Xing Han Lu 244 Dec 30, 2022
Unsupervised Language Modeling at scale for robust sentiment classification

** DEPRECATED ** This repo has been deprecated. Please visit Megatron-LM for our up to date Large-scale unsupervised pretraining and finetuning code.

NVIDIA Corporation 1k Nov 17, 2022
Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe NHV in the future.

Fast (GAN Based Neural) Vocoder Chinese README Todo Submit demo Support NHV Discription Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe include N

Zhengxi Liu (刘正曦) 134 Dec 16, 2022
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

Justin Terry 32 Nov 09, 2021
Yuqing Xie 2 Feb 17, 2022
Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Sploitus Command line search tool for sploitus.com. Think searchsploit, but with

watchdog2000 5 Mar 07, 2022
GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network.

GNES.ai 1.2k Jan 06, 2023
Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Breame ( British English and American English) Breame is a lightweight Python package with a number of utility tools to aid in the detection of words

Charles 8 Oct 10, 2022
PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

Facebook Research 605 Jan 02, 2023
An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.

Welcome to AdaptNLP A high level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models

Novetta 407 Jan 03, 2023
The proliferation of disinformation across social media has led the application of deep learning techniques to detect fake news.

Fake News Detection Overview The proliferation of disinformation across social media has led the application of deep learning techniques to detect fak

Kushal Shingote 1 Feb 08, 2022
Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in s

Jonas Belouadi 7 Nov 07, 2022
Installation, test and evaluation of Scribosermo speech-to-text engine

Scribosermo STT Setup Scribosermo is a LGPL licensed, open-source speech recognition engine to "Train fast Speech-to-Text networks in different langua

Florian Quirin 3 Jun 20, 2022
p-tuning for few-shot NLU task

p-tuning_NLU Overview 这个小项目是受乐于分享的苏剑林大佬这篇p-tuning 文章启发,也实现了个使用P-tuning进行NLU分类的任务, 思路是一样的,prompt实现方式有不同,这里是将[unused*]的embeddings参数抽取出用于初始化prompt_embed后

3 Dec 29, 2022