Evaluation suite for large-scale language models.

Overview

LM Evaluation Test Suite

This repo contains code for running the evaluations and reproducing the results from the Jurassic-1 Technical Paper (see blog post), with current support for running the tasks through both the AI21 Studio API and OpenAI's GPT3 API.

Citation

Please use the following bibtex entry:

@techreport{J1WhitePaper,
  author = {Lieber, Opher and Sharir, Or and Lenz, Barak and Shoham, Yoav},
  title = {Jurassic-1: Technical Details And Evaluation},
  institution = {AI21 Labs},
  year = 2021,
  month = aug,
}

Installation

git clone https://github.com/AI21Labs/lm-evaluation.git
cd lm-evaluation
pip install -e .

Usage

The entry point for running the evaluations is lm_evaluation/run_eval.py, which receives a list of tasks and models to run.

The models argument should be in the form "provider/model_name" where provider can be "ai21" or "openai" and the model name is one of the providers supported models.

When running through one of the API models, set the your API key(s) using the environment variables AI21_STUDIO_API_KEY and OPENAI_API_KEY. Make sure to consider the costs and quota limits of the models you are running beforehand.

Examples:

# Evaluate hellaswag and winogrande on j1-large
python -m lm_evaluation.run_eval --tasks hellaswag winogrande --models ai21/j1-large

# Evaluate all multiple-choice tasks on j1-jumbo
python -m lm_evaluation.run_eval --tasks all_mc --models ai21/j1-jumbo

# Evaluate all docprob tasks on curie and j1-large
python -m lm_evaluation.run_eval --tasks all_docprobs --models ai21/j1-large openai/curie

Datasets

The repo currently support the zero-shot multiple-choice and document probability datasets reported in the Jurassic-1 Technical Paper.

Multiple Choice

Multiple choice datasets are formatted as described in the GPT3 paper, and the default reported evaluation metrics are those described there.

All our formatted datasets except for storycloze are publically available and referenced in lm_evaluation/tasks_config.py. Storycloze needs to be manually downloaded and formatted, and the location should be configured through the environment variable 'STORYCLOZE_TEST_PATH'.

Document Probabilities

Document probability tasks include documents from 19 data sources, including C4 and datasets from 'The Pile'.

Each document is pre-split at sentence boundaries to sub-documents of up to 1024 GPT tokens each, to ensure all models see the same inputs/contexts regardless of tokenization, and to support evaluation of models which are limited to sequence lengths of 1024.

Each of the 19 tasks have ~4MB of total text data.

Additional Configuration

Results Folder

By default all results will be saved to the folder 'results', and rerunning the same tasks will load the existing results. The results folder can be changed using the environment variable LM_EVALUATION_RESULTS_DIR.

Activity tragle - Google is tracking everything, we just look at it

activity_tragle Google is tracking everything, we just look at it here. You need

BERNARD Guillaume 1 Feb 15, 2022
Toward Spatially Unbiased Generative Models (ICCV 2021)

Toward Spatially Unbiased Generative Models Implementation of Toward Spatially Unbiased Generative Models (ICCV 2021) Overview Recent image generation

Jooyoung Choi 88 Dec 01, 2022
EasyMocap is an open-source toolbox for markerless human motion capture from RGB videos.

EasyMocap is an open-source toolbox for markerless human motion capture from RGB videos. In this project, we provide the basic code for fitt

ZJU3DV 2.2k Jan 05, 2023
Learning with Noisy Labels via Sparse Regularization, ICCV2021

Learning with Noisy Labels via Sparse Regularization This repository is the official implementation of [Learning with Noisy Labels via Sparse Regulari

Xiong Zhou 38 Oct 20, 2022
'Aligned mixture of latent dynamical systems' (amLDS) for stimulus decoding probabilistic manifold alignment across animals. P. Herrero-Vidal et al. NeurIPS 2021 code.

Across-animal odor decoding by probabilistic manifold alignment (NeurIPS 2021) This repository is the official implementation of aligned mixture of la

Pedro Herrero-Vidal 3 Jul 12, 2022
HiFT: Hierarchical Feature Transformer for Aerial Tracking (ICCV2021)

HiFT: Hierarchical Feature Transformer for Aerial Tracking Ziang Cao, Changhong Fu, Junjie Ye, Bowen Li, and Yiming Li Our paper is Accepted by ICCV 2

Intelligent Vision for Robotics in Complex Environment 55 Nov 23, 2022
Keras like implementation of Deep Learning architectures from scratch using numpy.

Mini-Keras Keras like implementation of Deep Learning architectures from scratch using numpy. How to contribute? The project contains implementations

MANU S PILLAI 5 Oct 10, 2021
Using CNN to mimic the driver based on training data from Torcs

Behavioural-Cloning-in-autonomous-driving Using CNN to mimic the driver based on training data from Torcs. Approach First, the data was collected from

Sudharshan 2 Jan 05, 2022
BERTMap: A BERT-Based Ontology Alignment System

BERTMap: A BERT-based Ontology Alignment System Important Notices The relevant paper was accepted in AAAI-2022. Arxiv version is available at: https:/

KRR 36 Dec 24, 2022
PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval (M2HSE) PyTorch code fo

Xinlei-Pei 6 Dec 23, 2022
A criticism of a recent paper on buggy image downsampling methods in popular image processing and deep learning libraries.

A criticism of a recent paper on buggy image downsampling methods in popular image processing and deep learning libraries.

70 Jul 12, 2022
An AI made using artificial intelligence (AI) and machine learning algorithms (ML) .

DTech.AIML An AI made using artificial intelligence (AI) and machine learning algorithms (ML) . This is created by help of some members in my team and

1 Jan 06, 2022
On the Analysis of French Phonetic Idiosyncrasies for Accent Recognition

On the Analysis of French Phonetic Idiosyncrasies for Accent Recognition With the spirit of reproducible research, this repository contains codes requ

0 Feb 24, 2022
Public Code for NIPS submission SimiGrad: Fine-Grained Adaptive Batching for Large ScaleTraining using Gradient Similarity Measurement

Public code for NIPS submission "SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training using Gradient Similarity Measurement" This repo co

Heyang Qin 0 Oct 13, 2021
Adaptive Attention Span for Reinforcement Learning

Adaptive Transformers in RL Official implementation of Adaptive Transformers in RL In this work we replicate several results from Stabilizing Transfor

100 Nov 15, 2022
Code for "Diversity can be Transferred: Output Diversification for White- and Black-box Attacks"

Output Diversified Sampling (ODS) This is the github repository for the NeurIPS 2020 paper "Diversity can be Transferred: Output Diversification for W

50 Dec 11, 2022
A Vision Transformer approach that uses concatenated query and reference images to learn the relationship between query and reference images directly.

A Vision Transformer approach that uses concatenated query and reference images to learn the relationship between query and reference images directly.

24 Dec 13, 2022
Neural Factorization of Shape and Reflectance Under An Unknown Illumination

NeRFactor [Paper] [Video] [Project] This is the authors' code release for: NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown I

Google 283 Jan 04, 2023
YOLO5Face: Why Reinventing a Face Detector (https://arxiv.org/abs/2105.12931)

Introduction Yolov5-face is a real-time,high accuracy face detection. Performance Single Scale Inference on VGA resolution(max side is equal to 640 an

DeepCam Shenzhen 1.4k Jan 07, 2023
a basic code repository for basic task in CV(classification,detection,segmentation)

basic_cv a basic code repository for basic task in CV(classification,detection,segmentation,tracking) classification generate dataset train predict de

1 Oct 15, 2021