ZeroGen: Efficient Zero-shot Learning via Dataset Generation

Overview

ZEROGEN

This repository contains the code for our paper “ZeroGen: Efficient Zero-shot Learning via Dataset Generation”. Our implementation is built on the source code from dino. Thanks for their work.

If you use this code, please cite our paper:

@article{ye2022zerogen,
      title={ZeroGen: Efficient Zero-shot Learning via Dataset Generation}, 
      author={Jiacheng Ye and Jiahui Gao and Qintong Li and Hang Xu and Jiangtao Feng and Zhiyong Wu and Tao Yu and Lingpeng Kong},
      year={2022},
      eprint={2202.07922},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Setup

All requirements for ZEROGEN can be found in requirements.txt. You can install all required packages in a new environment with pip install -r requirements.txt.

Usage

The scripts/run_cls.sh and scripts/run_qa.sh scripts contain the running commands for the following settings:

  • supervised learning with human annotations (SUPERVISED)
  • prompt-based zero-shot learning (PROMPTING)
  • efficient zero-shot learning via dataset generation (ZEROGEN)

For text classification (TC) tasks (e.g., SST-2 and IMDb) and natural language inference (NLI) tasks (e.g., QNLI and RTE), run with bash scripts/run_cls.sh. For question answering (QA) tasks, run with bash scripts/run_qa.sh

When generating X (i.e., denotes text in TC, hypothesis in NLI and question in QA) in the final stage of the scripts, we also train the small model and evaluate it on human annotations. Specifically, after generating log_every number of examples, we perform training on the synthetic dataset and evaluation on the gold validation set. This gives as a trend graph similar to Figure 2 in the paper, which is shown by wandb, a powerful toolkit to track experiments.

Before running, you need to reset the following parameters to yours:

  • home_dir: path to ZeroGen
  • gpu: gpu id
  • batch_size: the batch size for generating with PLM. For SST-2, it costs ~16G when using a batch size of 32 with gpt2-xl. While for SQuAD, it costs ~60G using the same batch size and PLM because of the longer contexts. So decrease the batch size if needed.
  • WANDB_PROJECT: project name, by default ZeroGen
  • WANDB_ENTITY: your wandb username
  • WANDB_API_KEY: your api-key

By default we use GPT2-XL as pre-trained language model (PLM) and DistilBERT as tiny-task model (TAM), to modify the size of PLM and TAM, you can change model_name and small_model_name in run_xxx.sh scripts.

Run with a synthesized dataset

After dataset generation, we save the synthetic dataset at:

  • For TC and NLI: out-${task_name}-x2/${dataset}/${task_name}-dataset.jsonl (e.g., out-sst-2-x2/gpt2-xl_topk0_topp0.9_sst-2-x2/sst-2-dataset.jsonl). The file is in json line format (e.g., {"C": "The Book of Mormon Musical", "X": "The Book of Mormon Musical brings all the drama and excitement of a real revival of the Broadway production to the big screen.", "Y": 0}).
  • For QA: out-${task_name}-x2/${dataset}. We save the dataset in huggingface Dataset format.

To run DistilBERT given a generated dataset, you can use the scripts/run_distilbert.sh script.

To run a LSTM-based model given a generated dataset, you can use the scripts/run_cls_lstm.sh script. Before that, you have to download the datasets from google drive link, which contain the standard test files.

Diversity and Correctness of a synthesized dataset

Divesity

We use Self-BLEU to measure the diversity of a synthesized dataset. To calculate the Self-BLEU for a given dataset, you can see the example in scripts/run_self_bleu.sh script.

Correctness

To calculate the Correctness, you can take the following steps:

  1. Replace the following parameters in scripts/run_distilbert.sh script with:

    • small_model_name=roberta-large
    • dataset=: empty means using standard training set
    • limit=: empty means using full standard training set

    This will give you a RoBERTa-Large trained with full human annotations, which can be used as an evaluator.

  2. Replace the following parameters in scripts/run_distilbert.sh script with:

    • small_model_ckpt=tmp/checkpoint-xxx: the final RoBERTa-Large checkpoint saved in step 1.
    • limit=10000: the number of samples to use, by default 10000
    • dataset=xxx: the name of synthetic dataset (e.g., gpt2-xl_topk0_topp0.9_sst-2-x2)
    • no_train=true: disable training

    Run the script, and you will get Metric on standard dataset and Metric on synthetic dataset, which represents the Correctness of standard dataset and synthetic dataset, respectively.

Resources

We provide some synthetic datasets and standard datasets for training LSTM in this google drive link. When training DistilBERT, the standard dataset is directly downloaded by huggingface Dataset package. Note we use the same prompt for IMDb/SST-2, and SQuAD/AdversarialQA, therefore the synthetic datasets are also the same.

Proof-Of-Concept Piano-Drums Music AI Model/Implementation

Rock Piano "When all is one and one is all, that's what it is to be a rock and not to roll." ---Led Zeppelin, "Stairway To Heaven" Proof-Of-Concept Pi

Alex 4 Nov 28, 2021
[NeurIPS 2021] Towards Better Understanding of Training Certifiably Robust Models against Adversarial Examples | ⛰️⚠️

Towards Better Understanding of Training Certifiably Robust Models against Adversarial Examples This repository is the official implementation of "Tow

Sungyoon Lee 4 Jul 12, 2022
Provided is code that demonstrates the training and evaluation of the work presented in the paper: "On the Detection of Digital Face Manipulation" published in CVPR 2020.

FFD Source Code Provided is code that demonstrates the training and evaluation of the work presented in the paper: "On the Detection of Digital Face M

88 Nov 22, 2022
Code for C2-Matching (CVPR2021). Paper: Robust Reference-based Super-Resolution via C2-Matching.

C2-Matching (CVPR2021) This repository contains the implementation of the following paper: Robust Reference-based Super-Resolution via C2-Matching Yum

Yuming Jiang 151 Dec 26, 2022
Object Tracking and Detection Using OpenCV

Object tracking is one such application of computer vision where an object is detected in a video, otherwise interpreted as a set of frames, and the object’s trajectory is estimated. For instance, yo

Happy N. Monday 4 Aug 21, 2022
Python script to download the celebA-HQ dataset from google drive

download-celebA-HQ Python script to download and create the celebA-HQ dataset. WARNING from the author. I believe this script is broken since a few mo

133 Dec 21, 2022
Simple tools for logging and visualizing, loading and training

TNT TNT is a library providing powerful dataloading, logging and visualization utilities for Python. It is closely integrated with PyTorch and is desi

1.5k Jan 02, 2023
Boostcamp AI Tech 3rd / Basic Paper reading w.r.t Embedding

Boostcamp AI Tech 3rd : Basic Paper Reading w.r.t Embedding TL;DR 1992년부터 2018년도까지 이루어진 word/sentence embedding의 중요한 줄기를 이루는 기초 논문 스터디를 진행하고자 합니다. 논

Soyeon Kim 14 Nov 14, 2022
Research using Cirq!

ReCirq Research using Cirq! This project contains modules for running quantum computing applications and experiments through Cirq and Quantum Engine.

quantumlib 230 Dec 29, 2022
Unbalanced Feature Transport for Exemplar-based Image Translation (CVPR 2021)

UNITE and UNITE+ Unbalanced Feature Transport for Exemplar-based Image Translation (CVPR 2021) Unbalanced Intrinsic Feature Transport for Exemplar-bas

Fangneng Zhan 183 Nov 09, 2022
Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

2D-TAN (Optimized) Introduction This is an optimized re-implementation repository for AAAI'2020 paper: Learning 2D Temporal Localization Networks for

Joya Chen 112 Dec 31, 2022
Release of the ConditionalQA dataset

ConditionalQA Datasets accompanying the paper ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers. Disclaimer This dataset

14 Oct 17, 2022
A simple python library for fast image generation of people who do not exist.

Random Face A simple python library for fast image generation of people who do not exist. For more details, please refer to the [paper](https://arxiv.

Sergei Belousov 170 Dec 15, 2022
Image Segmentation using U-Net, U-Net with skip connections and M-Net architectures

Brain-Image-Segmentation Segmentation of brain tissues in MRI image has a number of applications in diagnosis, surgical planning, and treatment of bra

Angad Bajwa 8 Oct 27, 2022
A curated list of the latest breakthroughs in AI (in 2021) by release date with a clear video explanation, link to a more in-depth article, and code.

2021: A Year Full of Amazing AI papers- A Review 📌 A curated list of the latest breakthroughs in AI by release date with a clear video explanation, l

Louis-François Bouchard 2.9k Dec 31, 2022
The mini-MusicNet dataset

mini-MusicNet A music-domain dataset for multi-label classification Music transcription is sequence-to-sequence prediction problem: given an audio per

John Thickstun 4 Nov 09, 2022
This is an implementation of PIFuhd based on Pytorch

Open-PIFuhd This is a unofficial implementation of PIFuhd PIFuHD: Multi-Level Pixel-Aligned Implicit Function forHigh-Resolution 3D Human Digitization

Lingteng Qiu 235 Dec 19, 2022
Surrogate- and Invariance-Boosted Contrastive Learning (SIB-CL)

Surrogate- and Invariance-Boosted Contrastive Learning (SIB-CL) This repository contains all source code used to generate the results in the article "

Charlotte Loh 3 Jul 23, 2022
Implementation of Invariant Point Attention, used for coordinate refinement in the structure module of Alphafold2, as a standalone Pytorch module

Invariant Point Attention - Pytorch Implementation of Invariant Point Attention as a standalone module, which was used in the structure module of Alph

Phil Wang 113 Jan 05, 2023
ppo_pytorch_cpp - an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch

PPO Pytorch C++ This is an implementation of the proximal policy optimization algorithm for the C++ API of Pytorch. It uses a simple TestEnvironment t

Martin Huber 59 Dec 09, 2022