The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Last update: Dec 12, 2022

Overview

SuperGen

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Requirements

Before running, you need to first install the required packages by typing following commands (Using a virtual environment is recommended):

pip3 install -r requirements.txt

Overview

SuperGen is a Supervision Generation method for zero-shot learning on NLU tasks. Instead of training on task-specific data, SuperGen generates training data guided by label-descriptive prompts with a unidirectional language model and fine-tunes another language model on the generated data.

Training and Test Data: Our method does not use any task-specific data (e.g., original training set). We provide our generated training set and original dev set (used as the test set) of each GLUE task under the data directory: train.json files are the generated training set (after data selection); test.tsv files are the original GLUE dev set (used as the test set for evaluation purpose).
Pretraining Corpus: We provide the processed pretraining corpus (Wikipedia and OpenWebText) for generating training data for sequence-pair tasks under the pretrain_corpus directory; see the README file there for details.

Generating Training Data

The generated training set used in the paper are provided as train.json files under each task directory; you should be able to obtain very similar generated data by following the steps below:

Data Generation: The entry script for generating training data for GLUE tasks is gen_train_data.py. The basic usage is

python gen_train_data.py --task $TASK --label $LABEL --save_dir $SAVE_DIR --num_gen $NUM_GEN

You can generate training data of each label either by setting individual label name $LABEL one at a time or by setting $LABEL=all to generate data for all labels (this will still be done sequentially). You may want to set $NUM_GEN to be larger than the desired training set size, as only those texts with the highest generated probability will be used to form the final training set.

Data Selection: After generating the training data, the final training set can be constructed by running the following:

python src/gen_utils.py --task $TASK --num_select_samples $NUM_SELECT \
                        --read_dir $SAVE_DIR --save_dir $DATA_DIR

Example: We provide an example script run_gen.sh that includes the entire generation process for all GLUE tasks under the setting described in the paper.

Fine-Tuning

The entry script for fine-tuning on generated data is finetune.py. The basic usage is

python finetune.py \
    --task_name $TASK \
    --data_dir data/$TASK \
    --overwrite_output_dir \
    --do_train \
    --do_predict \
    --smooth $SM \
    --momentum $MOMENT \
    --eval_steps $INTERVAL \
    --threshold $TH \
    --reg_weight $REG \
    --temp_ensemble_rampup $RAMP \
    --model_name_or_path $MODEL \
    --max_seq_length 128 \
    --first_sent_limit 100 \
    --per_device_train_batch_size $BS \
    --learning_rate $LR \
    --num_train_epochs 3 \
    --output_dir $OUT_DIR \
    --template $TEMPLATE \
    --mapping $MAPPING \
    --warmup_ratio 0.1 \
    --save_at_last \

Example: We provide an example script run_finetune.sh with command line arguments set up for all GLUE tasks under the setting described in the paper.

Results: When using the same prompt-based fine-tuning pipeline (with the same manual prompts and label words), zero-shot SuperGen even achieves better performance than few-shot LM-BFF using 32 annotated samples per class across seven GLUE classification tasks:

Method	MNLI-m/mm	QQP	QNLI	SST-2	CoLA	RTE	MRPC	AVG
LM-BFF 32-Sample Few-Shot	68.3/70.5	65.5	64.5	92.7	9.3	69.1	74.5	63.6
SuperGen Zero-Shot	72.3/73.8	66.1	73.3	92.8	32.7	65.3	82.2	69.4

Acknowledgement

Some scripts in this repository are adapted from COCO-LM (for COCO-LM model), LM-BFF (for prompt-based fine-tuning) and huggingface transformers (for text generation and GLUE processor/trainer).

Citations

Please cite the following paper if you find the code helpful for your research.

@article{meng2022generating,
  title={Generating Training Data with Language Models: Towards Zero-Shot Language Understanding},
  author={Meng, Yu and Huang, Jiaxin and Zhang, Yu and Han, Jiawei},
  journal={arXiv preprint arXiv:2202.04538},
  year={2022}
}

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Related tags

Overview

SuperGen

Requirements

Overview

Generating Training Data

Fine-Tuning

Acknowledgement

Citations

Owner

Yu Meng

Old Photo Restoration (Official PyTorch Implementation)

Official Code for "Non-deep Networks"

ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

Neural machine translation between the writings of Shakespeare and modern English using TensorFlow

Image Lowpoly based on Centroid Voronoi Diagram via python-opencv and taichi

A PyTorch implementation of "DGC-Net: Dense Geometric Correspondence Network"

Shape Matching of Real 3D Object Data to Synthetic 3D CADs (3DV project @ ETHZ)

Code for our paper at ECCV 2020: Post-Training Piecewise Linear Quantization for Deep Neural Networks

Using contrastive learning and OpenAI's CLIP to find good embeddings for images with lossy transformations

QTool: A Low-bit Quantization Toolbox for Deep Neural Networks in Computer Vision

Answering Open-Domain Questions of Varying Reasoning Steps from Text

Code and models for "Rethinking Deep Image Prior for Denoising" (ICCV 2021)

Deep Neural Networks Improve Radiologists' Performance in Breast Cancer Screening

This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures

[ICCV 2021 Oral] Mining Latent Classes for Few-shot Segmentation

FCAF3D: Fully Convolutional Anchor-Free 3D Object Detection

A curated list of neural rendering resources.

Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.

Which Style Makes Me Attractive? Interpretable Control Discovery and Counterfactual Explanation on StyleGAN