True Few-Shot Learning with Language Models

Overview

True Few-Shot Learning with Language Models

This codebase supports using language models (LMs) for true few-shot learning: learning to perform a task using a limited number of examples from a single task distribution. We choose prompts and hyperparameters for few-shot learning methods using no additional held-out data via methods like cross-validation and minimum description length. The code reproduces the results in our paper and supports two forms of few-shot learning:

  1. "In-context" learning using LMs similar to GPT-3. Here, we format a few training examples as input to the LM using a natural language "prompt," and we use the LM to predict the next token. We include the code for in-context learning primarily in the top-level directory (largely in eval_lm.py).
  2. Finetuning via ADAPET, which learns from supervised examples using a modified classification loss alongside an auxiliary masked LM objective. We include the code for finetuning ADAPET in subdirectories (e.g., src/ for training/evaluation code).

You can run this codebase with GPT-2/DistilGPT-2 (using HuggingFace Transformers) or GPT-3 (if you have a key from OpenAI). The underlying model you use is abstracted away using a common API. Below, we describe how to reproduce our results, as well as how to download our precomputed results that we used to produce our paper's plots.

General Setup Instructions

Clone this repo into your current working directory. Then, follow the general setup instructions below:

cd true_few_shot    # Move into cloned directory
export BASE=$PWD    # Store top-level directory location
mkdir data exp_out  # Make directories for data and experiment results, or symlink to a location that can host large files

Continue below to reproduce our prompt selection experiments (with GPT models on LAMA or SuperGLUE). Skip to Hyperparameter Selection to reproduce our ADAPET results.

True Few-Shot Prompt Selection for GPT

Download data for LAMA experiments (LAMA-UHN data, LAMA and LPAQA prompts, and LAMA vocab):

cd data

# Download LAMA-UHN
wget https://www.cis.uni-muenchen.de/~poerner/blobs/e-bert/LAMA_UHN.zip
unzip LAMA_UHN.zip
mv data/* .
rmdir data
rm LAMA_UHN.zip

# Download LAMA vocab
wget https://dl.fbaipublicfiles.com/LAMA/common_vocab_cased.txt

# Download original LAMA (to get manual prompts)
wget https://dl.fbaipublicfiles.com/LAMA/data.zip
unzip data.zip
mv data/* .
rmdir data
rm data.zip

# Download LPAQA prompts from Google Drive (or download manually from https://drive.google.com/file/d/15ypcAYvQGYRtIQ-GH7qSLNDlJZ2Sqs0H/view?usp=sharing)
gdown --id 15ypcAYvQGYRtIQ-GH7qSLNDlJZ2Sqs0H
unzip lpaqa.zip
rm lpaqa.zip

Then, create a virtual Python 3.7+ environment. We installed and activated a Python 3.7 with Anaconda 3 (downloadable from docs.anaconda.com) like so:

conda create -y -n true_few_shot python=3.7
conda activate true_few_shot
# To deactivate the environment, use conda deactivate

Next, install the dependencies for this repo:

cd $BASE
pip install -r requirements_prompt.txt

To experiment with GPT2/DistilGPT2 models, you'll then need to install PyTorch to use transformers models (PyTorch download instructions at pytorch.org). (PyTorch installation is not required for GPT3 models.) We used PyTorch 1.7.1 (with CUDA 11.0.194 for GPU inference), installed with the below command:

torch_version_suffix="+cu110" # Use "+cu100" for CUDA 10.0, "+cu101" for CUDA 10.1, and "" for CUDA 10.2
pip install torch==1.7.1${torch_version_suffix} torchvision==0.8.2${torch_version_suffix} -f https://download.pytorch.org/whl/torch_stable.html

To experiment with GPT3 models, if you have GPT3 access, set your API key as a bash variable (otherwise, you can still run this repo using GPT2 models):

export OPENAI_API_KEY="api-key-here"

At this point, you'll need to get the results from LM(s) on LAMA or SuperGLUE, in order to later evaluate MDL/CV/Test Accuracy. You can get these results by:

  1. Running our true few-shot prompt selection experiments yourself by continuing below, or
  2. Loading our pre-run experiments for GPT2 (instead of having to run them yourself), by skipping to this section

After either of the above, we'll describe how you can plot the results in our paper.

Running true few-shot prompt selection experiments yourself

First, move to the top-level directory in this repo (e.g., run cd $BASE). From there, the following command will run inference with DistilGPT2 on LAMA-UHN:

python eval_lm.py --engine distilgpt2 --num_train 5 --seeds 0

This command chooses 5 random examples from LAMA-UHN as training examples, randomly orders them in 5!=120 different ways, and appends an unseen (test) example from LAMA-UHN to the end. Then, the code evaluates the log-probability of the correct answer for each train/test example, which we'll use later to compute CV/MDL/Test Accuracy. Below we describe each command line flag:

  • --engine: specifies the model you'd like to evaluate with
  • --num_train: specifies the number of training examples you'd like to use (here, 5); using more than 5 will result in evaluating MDL/CV/test accuracy using only a subset of all possible training example permutations, as described in our paper experiments when we use >5 training examples
  • --seeds: takes a list of integers (e.g., 0 1 2) and uses each integer as a random seed for sampling a new LAMA training set (per relation); so if you run with --seeds 0 1 2, you'll run on LAMA 3 times in total (useful for calculating mean and std. error over several runs).

We use a similar command to run on CB, RTE, and WiC, just adding a couple extra flags:

python eval_lm.py --engine distilgpt2 --num_train 5 --seeds 0 --data_name super_glue --rels cb rte wic

The --data_name flag specifies that you want to run on SuperGLUE datasets, and the --rels flags specifies which SuperGLUE datasets you'd like to run on. BoolQ is also supported (using --rels boolq), but be warned that the inputs are quite long, so which can make running GPT2 models time-consuming and running GPT3 models costly. Other datasets require some extra modification to use their respective few-shot approach described in the GPT3 paper.

At this point, if you've run one of the above eval_lm.py commands, you can skip to Post-processing GPT Results. Below, we show the specific commands we used to reproduce different sets of results in our paper.

For reference, we include a full bash loop you can run to reproduce all of our LAMA experiments that use 5 training examples:

for ENGINE in 'distilgpt2' 'gpt2' 'gpt2-medium' 'gpt2-large' 'gpt2-xl' 'ada' 'babbage' 'curie' 'davinci'; do  # Different models, in order: DistilGPT2, GPT2 (117M, 345M, 782M, 1.5B), and GPT3 (2.7B, 6.7B, 13B, 175B) 
for NT in 5; do  # 5 training examples
for SEED in 0 1 2 3 4; do  # 5 random seeds to sample different training sets
python eval_lm.py --engine $ENGINE --num_train $NT --seeds $SEED
# Results will be saved to $BASE/data/rel2template2results.data_name-TREx_UHN.engine-$ENGINE.num_train-$NT.sort_by_weight-False/seed-$SEED
done
done
done

You'll probably want to parallelize the calls to eval_lm.py, since the above will take a while on a single GPU. Running with GPT3 models (ada, babbage, curie, davinci) will query the OpenAI API (so you'll be charged for these queries).

Similarly, we include the full bash loop you can run to reproduce our results for true few-shot prompt selection varying the number of training examples used for selection:

for ENGINE in 'distilgpt2' 'gpt2' 'gpt2-medium' 'gpt2-large' 'gpt2-xl' 'ada' 'babbage'; do  # Here, we don't use curie (13B) and davinci (175B) models for cost reasons 
for NT in 5 10 15 20 30 40; do  # you won't need to run with NT=5 if you ran the above bash loop
for SEED in 0 1 2 3 4; do  # 5 random seeds to sample different training sets
python eval_lm.py --engine $ENGINE --num_train $NT --seeds $SEED
# Results will be saved to $BASE/data/rel2template2results.data_name-TREx_UHN.engine-$ENGINE.num_train-$NT.sort_by_weight-False/seed-$SEED
done
done
done

Lastly, we include the bash loop for reproducing our SuperGLUE experiments:

for ENGINE in 'distilgpt2' 'gpt2' 'gpt2-medium' 'gpt2-large' 'gpt2-xl' 'ada' 'babbage' 'curie' 'davinci'; do 
for NT in 5; do
for SEED in 0 1 2 3 4; do
python eval_lm.py --engine $ENGINE --num_train $NT --seeds $SEED --data_name super_glue --rels cb rte wic 
# Results will be saved to $BASE/data/rel2template2results.data_name-super_glue_UHN.engine-$ENGINE.num_train-$NT.sort_by_weight-False/seed-$SEED
done
done
done

Post-processing GPT Results

We save a lot of statistics about the predictions made by different GPT models using the eval_lm.py command above, which would make it time-consuming to load all of the data in every time we'd like to plot different results. Thus, we first extract the stats that we care about (e.g., stats we need to compute cv/mdl) and save them to smaller files like so:

cd $BASE
DATA_NAME="TREx"     # Use the "TREx" split of LAMA-UHN
FIELDS="nlls ranks"  # names of stats we use to compute cv/mdl/acc ("nlls" for LM Negative Log-Likelihood, "ranks" to get the rank of the true answer, according to the LM -- we will convert this to accuracy later on) 
python extract_fields.py --data_name $DATA_NAME --keys $FIELDS  # see the command line flags in this script, if you'd like to just extract results for a subset of models, training seeds, etc.

For SuperGLUE, we extract the stats for computing cv/mdl/acc as follows:

cd $BASE
DATA_NAME="super_glue"                   # Use super_glue datasets (WiC, RTE, CB)
FIELDS="verbalizer_nlls verbalize_accs"  # names of stats we use to compute cv/mdl/acc ("verbalizer_nlls" to get the NLL of the true answer after eliminating tokens that aren't "verbalizer" tokens or class names; "verbalizer_accs" to get the accuracy when only consider the probabilities of classes with class names instead of all possible tokens) 
python extract_fields.py --data_name $DATA_NAME --keys $FIELDS  # see the command line flags in this script, if you'd like to just extract results for a subset of models, training seeds, etc.

Now you can move to plotting the results from prompt selection using the results from the above commands.

Loading our pre-run experiments for prompt selection

You can load the results from our GPT-2 evaluation runs, to avoid having to evaluate all of the models yourself:

cd $BASE/data
gdown --id 1LSvNS_M47a8QcaZ5-ebNadG_ceW4LzIU  # or download manually from https://drive.google.com/file/d/1LSvNS_M47a8QcaZ5-ebNadG_ceW4LzIU/view
tar -xzvf eval_lm_results.tar.gz
mv eval_lm_results/* .
rmdir eval_lm_results
rm eval_lm_results.tar.gz

We do not have permission from OpenAI to release our GPT-3 results (please send us an email if this is an issue for you). However, our pre-computed GPT-2 results will still allow you to move on to plotting our results for GPT-2 models below.

Plotting results from true few-shot prompt selection

Then, you can compute and save our main paper plots (figures 1-2, as well as 5) for GPT2/DistilGPT2 models like so:

cd $BASE
python plot_results.py --exp 'TREx-vary_models'

The above command will save plots to $BASE/plots/TREx-vary_models. If you have also computed GPT-3 results, simply add the "--use_gpt3" flag to plot GPT-3 results as well.

To plot our other results, change the value for the experiment flag --exp:

--exp Plots results for...
TREx-vary_models various model sizes (all sizes; figures 1, 2, 5)
TREx-vary_num_train various numbers of training examples (all sizes up to 6.7B; figures 3, 4)
TREx-vary_criterion various model selection criteria (all criteria described in the Appendix)
RTE RTE
CB CB
WiC WiC

True Few-Shot Hyperparameter Selection for ADAPET

Here, we describe how to choose hyperparameters for ADAPET, a few-shot learning method that finetunes a language model using a classification loss alongside an auxiliary masked language modeling objective. We use a modified version of their original code, which we include in this repo. We now detail how to setup and run our version of the repo to reproduce our results.

ADAPET trains on GPU, so you'll need to ensure that CUDA is installed for NVIDIA GPUs. First, move to the main directory (cd $BASE) and deactivate any existing virtual environments (e.g. conda deactivate if using conda). Then run source bin/init.sh, which will automatically:

  • Download the FewGLUE and SuperGLUE datasets in data/fewglue/{task} and data/superglue/{task} respectively.
  • Install and setup environment with correct dependencies into a virtual environment.

If you run into issues installing the virtual env with the source bin/init.sh, you can also use conda (as we did) to run experiments instead. We installed Python 3.7.10 using Anaconda 4.8.3, as well as the required dependencies (including PyTorch 1.5) like so:

conda create -y -n adapet python=3.7.10
conda activate adapet
pip install -r requirements.txt

To train with an AMD GPU instead of NVIDIA, you'll need to install ROCm (AMD's CUDA) and then install a ROCm-compatible version of PyTorch 1.8+ instead of PyTorch 1.5 as done above (see pytorch.org for instructions).

We shuffle FewGLUE and generate the 3 additional random subsets of SuperGLUE used in our paper with:

python subsample_superglue.py

For ReCoRD, you'll also need to download separate files containing the few-shot training set labels (we had to format these in a special way for our evaluation scripts later on):

cd $BASE/data/fewglue/ReCoRD
gdown --id 1oTURVQ0Zvoq5cQr7PtqPMc775bp9o6fz
tar -xzvf eval_train_labels.tar.gz
rm eval_train_labels.tar.gz
mv eval_train_labels/* .
rmdir eval_train_labels

You can train an ADAPET model on the full WiC dataset like so:

TN="WiC"     # For other tasks, change to "CB" "COPA" "BoolQ" "RTE" "WSC" "MultiRC" "ReCoRD"
TSS=0        # Random seed for sampling training set. Use TSS=0 for FewGLUE. We used TSS in 0 1 2 3 for our 4 training sets
FMR="False"  # Whether or not to use a fixed masking ratio (as opposed to variable masking ratio -- see ADAPET Appendix Table 12. We always use "False" or variable masking, following ADAPET.)
MA=0.105     # Mask alpha or fraction to use (we sweep over 0.075 0.10 0.105 0.15 following ADAPET)
checkpoint_dirname="tss-$TSS.ma-$MA.fmr-$FMR"
checkpoint_dir="exp_out/fewglue/$TN/albert-xxlarge-v2/$checkpoint_dirname"
mkdir -p $checkpoint_dir  # Make directory for saving results
python src/train.py -c config/$TN.json -k "exp_name='$checkpoint_dirname'" "mask_alpha=$MA" "fixed_mask_ratio=$FMR" "train_set_seed=$TSS" "save_model=True"

Evaluate the K=8 fold cross-validation loss for the above dataset/hyperparameters by training 8 models as follows (the code refers to "folds" as "blocks", following terminology in MDL):

# Set same hyperparameters as before
TN="WiC"
TSS=0
FMR="False"
MA=0.105

SM="cv"      # selection method: "cv" for cross-validation or "mdl" for minimum description length
NB=8         # number of blocks/folds for mdl/cv evaluation 
for BN in $(seq 0 $((NB-1))); do  # iterate over all block (fold) numbers
    checkpoint_dirname="tss-$TSS.ma-$MA.fmr-$FMR.sm-$SM.nb-$NB.bn-$BN"
    checkpoint_dir="exp_out/fewglue/$TN/albert-xxlarge-v2/$checkpoint_dirname"
    mkdir -p $checkpoint_dir
    python src/train.py -c config/$TN.json -k "exp_name='$checkpoint_dirname'" "selection_method='$SM'" "num_blocks=$NB" "block_no=$BN" "mask_alpha=$MA" "fixed_mask_ratio=$FMR" "train_set_seed=$TSS"
done

For more details on how to train/evaluate/test ADAPET, please see the README of the original ADAPET repo, which we used for our code (with only minor modifications).

You can load the results from our training runs, to avoid having to train all of the models yourself:

# Download training run results
cd $BASE/exp_out
rm -r fewglue  # delete any existing results, so we can replace them with our pre-computed results
gdown --id 1Kz5E7v-ejLFeLGSUKPvy9aAPe9_NByZa  # or download manually from https://drive.google.com/file/d/1NAd8nhvpQTl3AYG3jT2ijBtm313JtSDY/view
tar -xzvf fewglue_adapet_results.tar.gz
rm fewglue_adapet_results.tar.gz

You can then print the results from CV/MDL hyperparameter selection with:

cd $BASE
python adapet.py

The above command will print a latex table showing the results for the best/worst/mean/median hyperparameters, as well as the CV/MDL-chosen hyperparameters. You can also show a subset of results (e.g., if you haven't trained on all SuperGLUE tasks), by using command line flags:

cd $BASE
TNS="MultiRC WiC BoolQ"  # Task Names to evaluate on, from {'BoolQ', 'CB', 'COPA', 'RTE', 'WiC', 'WSC', 'MultiRC', 'ReCoRD'} \
TSSS="0 1"  # Train Set Seeds to show mean/std. dev. results for (we used 0 1 2 3 in our paper) \
SMS="cv mdl"  # Selection Methods used to choosen hyperparameters, in {'cv', 'mdl'}
python adapet.py --tns $TNS --tsss $TSSS --sms $SMS

Feel free to open an issue if you have any questions, and have fun true few-shot learning!

Bibtex Citation

@article{perez2021true,
  author = {Ethan Perez and Douwe Kiela and Kyunghyun Cho},
  title = {True Few-Shot Learning with Language Models},
  journal={arXiv},
  year = {2021},
  url = {https://arxiv.org/abs/2105.11447}
}
Owner
Ethan Perez
I'm a Ph.D. student in NLP at NYU. My research is on developing question-answering methods that can generalize to harder questions than we have supervision for.
Ethan Perez
SWA Object Detection

SWA Object Detection This project hosts the scripts for training SWA object detectors, as presented in our paper: @article{zhang2020swa, title={SWA

237 Nov 28, 2022
Implementation of [Time in a Box: Advancing Knowledge Graph Completion with Temporal Scopes].

Time2box Implementation of [Time in a Box: Advancing Knowledge Graph Completion with Temporal Scopes].

LingCai 4 Aug 23, 2022
The Power of Scale for Parameter-Efficient Prompt Tuning

The Power of Scale for Parameter-Efficient Prompt Tuning Implementation of soft embeddings from https://arxiv.org/abs/2104.08691v1 using Pytorch and H

Kip Parker 208 Dec 30, 2022
you can add any codes in any language by creating its respective folder (if already not available).

HACKTOBERFEST-2021-WEB-DEV Beginner-Hacktoberfest Need Your first pr for hacktoberfest 2k21 ? come on in About This is repository of Responsive Portfo

Suman Sharma 8 Oct 17, 2022
Yggdrasil - A simplistic bot designed to streamline your server experience

Ygggdrasil A simplistic bot designed to streamline your server experience. Desig

Sntx_ 1 Dec 14, 2022
CLEAR algorithm for multi-view data association

CLEAR: Consistent Lifting, Embedding, and Alignment Rectification Algorithm The Matlab, Python, and C++ implementation of the CLEAR algorithm, as desc

MIT Aerospace Controls Laboratory 30 Jan 02, 2023
Post-Training Quantization for Vision transformers.

PTQ4ViT Post-Training Quantization Framework for Vision Transformers. We use the twin uniform quantization method to reduce the quantization error on

Zhihang Yuan 61 Dec 28, 2022
통일된 DataScience 폴더 구조 제공 및 가상환경 작업의 부담감 해소

Lucas coded by linux shell 목차 Mac버전 CookieCutter (autoenv) 1.How to Install autoenv 2.폴더 진입 시, activate 구현하기 3.폴더 탈출 시, deactivate 구현하기 4.Alias 설정하기 5

ello 3 Feb 21, 2022
structured-generative-modeling

This repository contains the implementation for the paper Information Theoretic StructuredGenerative Modeling, Specially thanks for the open-source co

0 Oct 11, 2021
Powerful and efficient Computer Vision Annotation Tool (CVAT)

Computer Vision Annotation Tool (CVAT) CVAT is free, online, interactive video and image annotation tool for computer vision. It is being used by our

OpenVINO Toolkit 8.6k Jan 01, 2023
Official repository for the CVPR 2021 paper "Learning Feature Aggregation for Deep 3D Morphable Models"

Deep3DMM Official repository for the CVPR 2021 paper Learning Feature Aggregation for Deep 3D Morphable Models. Requirements This code is tested on Py

38 Dec 27, 2022
Full Resolution Residual Networks for Semantic Image Segmentation

Full-Resolution Residual Networks (FRRN) This repository contains code to train and qualitatively evaluate Full-Resolution Residual Networks (FRRNs) a

Toby Pohlen 274 Oct 27, 2022
StableSims is an open-source project aimed at simulating MakerDAO's Dai stablecoin system

StableSims is an open-source project aimed at simulating MakerDAO's Dai stablecoin system, initially used for researching optimal incentive parameters for Liquidations 2.0.

Blockchain at Berkeley 52 Nov 21, 2022
MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

MAVE: : A Product Dataset for Multi-source Attribute Value Extraction The dataset contains 3 million attribute-value annotations across 1257 unique ca

Google Research Datasets 89 Jan 08, 2023
Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

This repository is a toolkit to do machine learning for programming languages. It implements tokenization, dataset preprocessing, model training and m

Facebook Research 408 Jan 01, 2023
Normalizing Flows with a resampled base distribution

Resampling Base Distributions of Normalizing Flows Normalizing flows are a popular class of models for approximating probability distributions. Howeve

Vincent Stimper 24 Nov 03, 2022
Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh

Arjun Majumdar 44 Dec 14, 2022
Company clustering with K-means/GMM and visualization with PCA, t-SNE, using SSAN relation extraction

RE results graph visualization and company clustering Installation pip install -r requirements.txt python -m nltk.downloader stopwords python3.7 main.

Jieun Han 1 Oct 06, 2022
Implementation of fast algorithms for Maximum Spanning Tree (MST) parsing that includes fast ArcMax+Reweighting+Tarjan algorithm for single-root dependency parsing.

Fast MST Algorithm Implementation of fast algorithms for (Maximum Spanning Tree) MST parsing that includes fast ArcMax+Reweighting+Tarjan algorithm fo

Miloš Stanojević 11 Oct 14, 2022
Implementation of Uniformer, a simple attention and 3d convolutional net that achieved SOTA in a number of video classification tasks

Uniformer - Pytorch Implementation of Uniformer, a simple attention and 3d convolutional net that achieved SOTA in a number of video classification ta

Phil Wang 90 Nov 24, 2022