Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Last update: Dec 03, 2022

Related tags

Overview

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Introduction

We propose a generalization of leaderboards, bidimensional leaderboards (Billboards), that simultaneously drives progress in language generation tasks and their evaluation. We accept two types of submissions:

Generator developers submit output text. A Billboard computes all metric scores.
Metric developers submit an executable program. A Billboard computes correlations with the human judgments, updates the ensemble metric, and measures how much it overrates machine over human generations.

Anonymous submissions are allowed!!

Submit

Submission guides and examples are available here.

Scoring Results

Scoring results for all past public submissions are available here. We have generator-name||metric-name.csv files from the Cartesian product between the generators and metrics: each contains instance-level scores.

Citations

Bidimesional Leaderboards

@misc{kasai2021billboard,
    title   = {Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand},
    author  = {Jungo Kasai and Keisuke Sakaguchi and Ronan Le Bras and Lavinia Dunagan and Jacob Morrison and Alexander R. Fabbri and Yejin Choi and Noah A. Smith},
    year    = {2021},
    url     = {https://arxiv.org/abs/2112.04139}, 
}

MSCOCO Captioning Evaluations and THumB 1.0 Protocol

@misc{kasai2021thumb,
    title   = {Transparent Human Evaluation for Image Captioning},
    author  = {Jungo Kasai and Keisuke Sakaguchi and Lavinia Dunagan and Jacob Morrison and Ronan Le Bras and Yejin Choi and Noah A. Smith},
    year    = {2021},
    url     = {https://arxiv.org/abs/2111.08940}, 
}

CNNDM Summarization Evaluations

@article{fabbri2021summeval,
    title   = {{SummEval}: Re-evaluating Summarization Evaluation},
    author  = {Fabbri, Alexander R and Kry{\'s}ci{\'n}ski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard and Radev, Dragomir},
    journal = {TACL},
    year    = {2021},
    url     = {https://arxiv.org/abs/2007.12626},
}

WMT20 ZH-EN/EN-DE Machine Translation Evaluations

@misc{freitag2021experts,
      title={Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation}, 
      author={Markus Freitag and George Foster and David Grangier and Viresh Ratnakar and Qijun Tan and Wolfgang Macherey},
      year={2021},
      url={https://arxiv.org/abs/2104.14478},
}

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Related tags

Overview

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Introduction

Submit

Scoring Results

Citations

Bidimesional Leaderboards

MSCOCO Captioning Evaluations and THumB 1.0 Protocol

CNNDM Summarization Evaluations

WMT20 ZH-EN/EN-DE Machine Translation Evaluations

Owner

A PyTorch Implementation of the Luna: Linear Unified Nested Attention

A list of Machine Learning Art Colabs

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

Official PyTorch code for the paper: "Point-Based Modeling of Human Clothing" (ICCV 2021)

A flexible ML framework built to simplify medical image reconstruction and analysis experimentation.

Implementation of CSRL from the AAAI2022 paper: Constraint Sampling Reinforcement Learning: Incorporating Expertise For Faster Learning

Self-Supervised Image Denoising via Iterative Data Refinement

Fuzzing the Kernel Using Unicornafl and AFL++

PyTorch implementation for MINE: Continuous-Depth MPI with Neural Radiance Fields

Creative Applications of Deep Learning w/ Tensorflow

Rainbow DQN implementation that outperforms the paper's results on 40% of games using 20x less data 🌈

Rule-based Customer Segmentation

GraphLily: A Graph Linear Algebra Overlay on HBM-Equipped FPGAs

Repository sharing code and the model for the paper "Rescoring Sequence-to-Sequence Models for Text Line Recognition with CTC-Prefixes"

code for our ECCV 2020 paper "A Balanced and Uncertainty-aware Approach for Partial Domain Adaptation"

The Hailo Model Zoo includes pre-trained models and a full building and evaluation environment

Everything's Talkin': Pareidolia Face Reenactment (CVPR2021)

Learning Logic Rules for Document-Level Relation Extraction

For storing the complete exploration of Visual Question Answering for our B.Tech Project

Large dataset storage format for Pytorch