Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Last update: Dec 03, 2022

Related tags

Overview

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Introduction

We propose a generalization of leaderboards, bidimensional leaderboards (Billboards), that simultaneously drives progress in language generation tasks and their evaluation. We accept two types of submissions:

Generator developers submit output text. A Billboard computes all metric scores.
Metric developers submit an executable program. A Billboard computes correlations with the human judgments, updates the ensemble metric, and measures how much it overrates machine over human generations.

Anonymous submissions are allowed!!

Submit

Submission guides and examples are available here.

Scoring Results

Scoring results for all past public submissions are available here. We have generator-name||metric-name.csv files from the Cartesian product between the generators and metrics: each contains instance-level scores.

Citations

Bidimesional Leaderboards

@misc{kasai2021billboard,
    title   = {Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand},
    author  = {Jungo Kasai and Keisuke Sakaguchi and Ronan Le Bras and Lavinia Dunagan and Jacob Morrison and Alexander R. Fabbri and Yejin Choi and Noah A. Smith},
    year    = {2021},
    url     = {https://arxiv.org/abs/2112.04139}, 
}

MSCOCO Captioning Evaluations and THumB 1.0 Protocol

@misc{kasai2021thumb,
    title   = {Transparent Human Evaluation for Image Captioning},
    author  = {Jungo Kasai and Keisuke Sakaguchi and Lavinia Dunagan and Jacob Morrison and Ronan Le Bras and Yejin Choi and Noah A. Smith},
    year    = {2021},
    url     = {https://arxiv.org/abs/2111.08940}, 
}

CNNDM Summarization Evaluations

@article{fabbri2021summeval,
    title   = {{SummEval}: Re-evaluating Summarization Evaluation},
    author  = {Fabbri, Alexander R and Kry{\'s}ci{\'n}ski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard and Radev, Dragomir},
    journal = {TACL},
    year    = {2021},
    url     = {https://arxiv.org/abs/2007.12626},
}

WMT20 ZH-EN/EN-DE Machine Translation Evaluations

@misc{freitag2021experts,
      title={Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation}, 
      author={Markus Freitag and George Foster and David Grangier and Viresh Ratnakar and Qijun Tan and Wolfgang Macherey},
      year={2021},
      url={https://arxiv.org/abs/2104.14478},
}

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Related tags

Overview

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Introduction

Submit

Scoring Results

Citations

Bidimesional Leaderboards

MSCOCO Captioning Evaluations and THumB 1.0 Protocol

CNNDM Summarization Evaluations

WMT20 ZH-EN/EN-DE Machine Translation Evaluations

Owner

Long Expressive Memory (LEM)

Tensorflow implementation of "Learning Deep Features for Discriminative Localization"

PyTorch code for MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

Omniscient Video Super-Resolution

Randomized Correspondence Algorithm for Structural Image Editing

The Habitat-Matterport 3D Research Dataset - the largest-ever dataset of 3D indoor spaces.

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

Doing the asl sign language classification on static images using graph neural networks.

[NeurIPS 2021] Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data

Bayesian Image Reconstruction using Deep Generative Models

GRaNDPapA: Generator of Rad Names from Decent Paper Acronyms

DuBE: Duple-balanced Ensemble Learning from Skewed Data

PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation

Here is the implementation of our paper S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations.

HGCAE Pytorch implementation. CVPR2021 accepted.

Chainer Implementation of Fully Convolutional Networks. (Training code to reproduce the original result is available.)

Uncertainty Estimation via Response Scaling for Pseudo-mask Noise Mitigation in Weakly-supervised Semantic Segmentation

Do Smart Glasses Dream of Sentimental Visions? Deep Emotionship Analysis for Eyewear Devices

Object detection, 3D detection, and pose estimation using center point detection:

This repository contains the DendroMap implementation for scalable and interactive exploration of image datasets in machine learning.