Evaluation suite for large-scale language models.

Overview

LM Evaluation Test Suite

This repo contains code for running the evaluations and reproducing the results from the Jurassic-1 Technical Paper (see blog post), with current support for running the tasks through both the AI21 Studio API and OpenAI's GPT3 API.

Citation

Please use the following bibtex entry:

@techreport{J1WhitePaper,
  author = {Lieber, Opher and Sharir, Or and Lenz, Barak and Shoham, Yoav},
  title = {Jurassic-1: Technical Details And Evaluation},
  institution = {AI21 Labs},
  year = 2021,
  month = aug,
}

Installation

git clone https://github.com/AI21Labs/lm-evaluation.git
cd lm-evaluation
pip install -e .

Usage

The entry point for running the evaluations is lm_evaluation/run_eval.py, which receives a list of tasks and models to run.

The models argument should be in the form "provider/model_name" where provider can be "ai21" or "openai" and the model name is one of the providers supported models.

When running through one of the API models, set the your API key(s) using the environment variables AI21_STUDIO_API_KEY and OPENAI_API_KEY. Make sure to consider the costs and quota limits of the models you are running beforehand.

Examples:

# Evaluate hellaswag and winogrande on j1-large
python -m lm_evaluation.run_eval --tasks hellaswag winogrande --models ai21/j1-large

# Evaluate all multiple-choice tasks on j1-jumbo
python -m lm_evaluation.run_eval --tasks all_mc --models ai21/j1-jumbo

# Evaluate all docprob tasks on curie and j1-large
python -m lm_evaluation.run_eval --tasks all_docprobs --models ai21/j1-large openai/curie

Datasets

The repo currently support the zero-shot multiple-choice and document probability datasets reported in the Jurassic-1 Technical Paper.

Multiple Choice

Multiple choice datasets are formatted as described in the GPT3 paper, and the default reported evaluation metrics are those described there.

All our formatted datasets except for storycloze are publically available and referenced in lm_evaluation/tasks_config.py. Storycloze needs to be manually downloaded and formatted, and the location should be configured through the environment variable 'STORYCLOZE_TEST_PATH'.

Document Probabilities

Document probability tasks include documents from 19 data sources, including C4 and datasets from 'The Pile'.

Each document is pre-split at sentence boundaries to sub-documents of up to 1024 GPT tokens each, to ensure all models see the same inputs/contexts regardless of tokenization, and to support evaluation of models which are limited to sequence lengths of 1024.

Each of the 19 tasks have ~4MB of total text data.

Additional Configuration

Results Folder

By default all results will be saved to the folder 'results', and rerunning the same tasks will load the existing results. The results folder can be changed using the environment variable LM_EVALUATION_RESULTS_DIR.

Complex Answer Generation For Conversational Search Systems.

Complex Answer Generation For Conversational Search Systems. Code for Does Structure Matter? Leveraging Data-to-Text Generation for Answering Complex

Hanane Djeddal 0 Dec 06, 2021
Implementation for our ICCV 2021 paper: Dual-Camera Super-Resolution with Aligned Attention Modules

DCSR: Dual Camera Super-Resolution Implementation for our ICCV 2021 oral paper: Dual-Camera Super-Resolution with Aligned Attention Modules paper | pr

Tengfei Wang 110 Dec 20, 2022
A PyTorch re-implementation of Neural Radiance Fields

nerf-pytorch A PyTorch re-implementation Project | Video | Paper NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis Ben Mildenhall

Krishna Murthy 709 Jan 09, 2023
Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

NÜWA - Pytorch (wip) Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch. This repository will be popul

Phil Wang 463 Dec 28, 2022
code for Image Manipulation Detection by Multi-View Multi-Scale Supervision

MVSS-Net Code and models for ICCV 2021 paper: Image Manipulation Detection by Multi-View Multi-Scale Supervision Update 22.02.17, Pretrained model for

dong_chengbo 131 Dec 30, 2022
Calculates carbon footprint based on fuel mix and discharge profile at the utility selected. Can create graphs and tabular output for fuel mix based on input file of series of power drawn over a period of time.

carbon-footprint-calculator Conda distribution ~/anaconda3/bin/conda install anaconda-client conda-build ~/anaconda3/bin/conda config --set anaconda_u

Seattle university Renewable energy research 7 Sep 26, 2022
PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

Reference-Based-Sketch-Image-Colorization-ImageNet This is a PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization usin

Yuzhi ZHAO 11 Jul 28, 2022
CarND-LaneLines-P1 - Lane Finding Project for Self-Driving Car ND

Finding Lane Lines on the Road Overview When we drive, we use our eyes to decide where to go. The lines on the road that show us where the lanes are a

Udacity 769 Dec 27, 2022
Pytorch code for "Text-Independent Speaker Verification Using 3D Convolutional Neural Networks".

:speaker: Deep Learning & 3D Convolutional Neural Networks for Speaker Verification

Amirsina Torfi 114 Dec 18, 2022
Code for Generating Disentangled Arguments with Prompts: A Simple Event Extraction Framework that Works

GDAP Code for Generating Disentangled Arguments with Prompts: A Simple Event Extraction Framework that Works Environment Python (verified: v3.8) CUDA

45 Oct 29, 2022
Research on controller area network Intrusion Detection Systems

Group members information Member 1: Lixue Liang Member 2: Yuet Lee Chan Member 3: Xinruo Zhang Member 4: Yifei Han User Manual Generate Attack Packets

Roche 4 Aug 30, 2022
This code is an implementation for Singing TTS.

MLP Singer This code is an implementation for Singing TTS. The algorithm is based on the following papers: Tae, J., Kim, H., & Lee, Y. (2021). MLP Sin

Heejo You 22 Dec 23, 2022
[SIGGRAPH Asia 2019] Artistic Glyph Image Synthesis via One-Stage Few-Shot Learning

AGIS-Net Introduction This is the official PyTorch implementation of the Artistic Glyph Image Synthesis via One-Stage Few-Shot Learning. paper | suppl

Yue Gao 102 Jan 02, 2023
Bayesian Optimization Library for Medical Image Segmentation.

bayesmedaug: Bayesian Optimization Library for Medical Image Segmentation. bayesmedaug optimizes your data augmentation hyperparameters for medical im

Şafak Bilici 7 Feb 10, 2022
Compute FID scores with PyTorch.

FID score for PyTorch This is a port of the official implementation of Fréchet Inception Distance to PyTorch. See https://github.com/bioinf-jku/TTUR f

2.1k Jan 06, 2023
Supervised Contrastive Learning for Downstream Optimized Sequence Representations

SupCL-Seq 📖 Supervised Contrastive Learning for Downstream Optimized Sequence representations (SupCS-Seq) accepted to be published in EMNLP 2021, ext

Hooman Sedghamiz 18 Oct 21, 2022
PyTorch implementation of ARM-Net: Adaptive Relation Modeling Network for Structured Data.

A ready-to-use framework of latest models for structured (tabular) data learning with PyTorch. Applications include recommendation, CRT prediction, healthcare analytics, and etc.

48 Nov 30, 2022
Code and training data for our ECCV 2016 paper on Unsupervised Learning

Shuffle and Learn (Shuffle Tuple) Created by Ishan Misra Based on the ECCV 2016 Paper - "Shuffle and Learn: Unsupervised Learning using Temporal Order

Ishan Misra 44 Dec 08, 2021
Spherical CNNs

Spherical CNNs Equivariant CNNs for the sphere and SO(3) implemented in PyTorch Overview This library contains a PyTorch implementation of the rotatio

Jonas Köhler 893 Dec 28, 2022
This library contains a Tensorflow implementation of the paper Stability Analysis of Unfolded WMMSE for Power Allocation

UWMMSE-stability Tensorflow implementation of Stability Analysis of UWMMSE Overview This library contains a Tensorflow implementation of the paper Sta

Arindam Chowdhury 1 Nov 16, 2022