Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Overview

Hurdles to Progress in Long-form Question Answering

This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hurdles to Progress in Long-form Question Answering". This repository supports inference from the pretrained retriever / generator & includes evaluation scripts.

Specifically, this codebase contains the model checkpoints, inference scripts for the retriever / generator model, generated outputs from model using c-REALM retrievals and random retrievals, scripts to compute ROUGE-L / R-Prec scores using the generations, scripts for question paraphrase classification, scripts for computing ROUGE-L bounds. You can also find the original Routing Transformer model's codebase here.

Setup

pip install transformers
pip install tensor2tensor

For GPU support, you might need to change the version of your TensorFlow depending on the CUDA / CuDNN installation (details). GPU support is strongly recommended for faster inference.

Model Checkpoints & Generations

Routing Transformer finetuned on ELI5: link
c-REALM TF Hub model + encoded retrieval corpora: link
c-REALM tokenized KILT Wikipedia data: link
c-REALM tokenized ELI5 training data: link
Pre-computed generations & QQP classifier: link

The original Routing Transformer model (pretrained on PG-19) and a local attention version of it can be found in the main repository (link).

Generating from the Routing Transformer

(We have provided the pre-computed retrievals from c-REALM on ELI5, so no need to run the c-REALM retriever)

  1. Download the "Routing Transformer finetuned on ELI5" model listed above and place it inside models.
wget https://storage.googleapis.com/rt-checkpoint/eli5_checkpoint.zip
unzip eli5_checkpoint.zip -d models
rm eli5_checkpoint.zip
  1. Download the generations folder from the Google Drive link listed as "Pre-computed generations & QQP classifier" above.

  2. Run eval_generate_eli5.py to generate from the model. We have provided c-REALM retrieval outputs in the script for the ELI5 validation / test split. For custom inputs, you will need to load the retriever and wikipedia corpus (see next section). Generation is on the slower side (~4 minutes per ELI5 QA pair on a 1080ti GPU), we hope to switch to the faster decoding mode in the Routing Transformer model in the near future.

Retrievals from c-REALM

(This script only tests the retriever, it doesn't depend on the Routing Transformer generator model)

  1. Download the "c-REALM TF Hub model + encoded retrieval corpora" model listed above. Place it inside the models folder.
wget https://storage.googleapis.com/rt-checkpoint/retriever.zip
unzip retriever.zip -d models
rm retriever.zip
  1. Download "c-REALM tokenized KILT Wikipedia data" if you are interested in retrieving from the KILT Wikipedia corpus and/or "c-REALM tokenized ELI5 training data" if you are interested in retrieving question paraphrases from the ELI5 training set. Place them inside the models folder.
wget https://storage.googleapis.com/rt-checkpoint/eli5_retrieval_train.zip
unzip eli5_retrieval_train.zip -d models
rm eli5_retrieval_train.zip
  1. Run eval_retriever_eli5.py to retrieve using c-REALM. Modify the --retrieval_corpus flag to choose the retrieval corpus.

Evaluation of Outputs

Setup

  1. Download the generations folder from the Google Drive link into this root folder.

  2. Clone the KILT repository in this folder and run the installation in a virtual environment.

git clone https://github.com/facebookresearch/KILT
cd KILT
virtualenv -p python3.7 kilt-venv
pip install -r requirements.txt
pip install --editable .
  1. If you are interested in using the Quora Question Paraphrase classifier (used in Section 3.2 of the paper), download the roberta-large-finetuned-qqp folder from "Pre-computed generations & QQP classifier" listed above. This model was built by Tu Vu.

  2. Download the ELI5 train, validation and test splits.

cd KILT
wget http://dl.fbaipublicfiles.com/KILT/eli5-train-kilt.jsonl -O train.jsonl
wget http://dl.fbaipublicfiles.com/KILT/eli5-dev-kilt.jsonl -O valid.jsonl
wget http://dl.fbaipublicfiles.com/KILT/eli5-test_without_answers-kilt.jsonl -O test.jsonl

Running evaluations

Enter the KILT folder and run the following command for evaluating p=0.6 with c-REALM retrievals on the validation set:

python kilt/eval_downstream.py ../generations/final_guess_eli5_0.6_predicted_retrieval.jsonl ../generations/final_gold_eli5_0.6_predicted_retrieval.jsonl

which should give you the output (partly reported in Table 6 of the paper),

{   'downstream': {   'accuracy': 0.0,
                      'em': 0.0,
                      'f1': 0.25566078582652935,
                      'rougel': 0.24417152125142375},
    'kilt': {   'KILT-accuracy': 0.0,
                'KILT-em': 0.0,
                'KILT-f1': 0.03414819887348917,
                'KILT-rougel': 0.03205580975169385},
    'retrieval': {'Rprec': 0.13258897418004187, '[email protected]': 0.2122586648057688}}

To evaluate other configurations, modify the paths in the command above. You can replace 0.6 with 0.9 for higher entropy generations, and replace predicted with random for randomly selected retrieval paragraphs (Hurdle #1 or Section 3.1 in the paper). Note that you should make this change for both the guess and gold files, to ensure correct alignment. We have only provided generations for the validation set since the test set answers / retrievals for ELI5 are hidden behind the KILT leaderboard.

Question paraphrase classification using QQP Classifier

In Section 3.2 of our paper, we used a Quora Question Paraphrase classifier to find question paraphrases amoung similar questions retrieved by c-REALM. To run this, make sure you have downloaded the QQP checkpoint (step 3 in Setup) and run,

python run_qqp.py --input_file generations/final_guess_eli5_0.6_similar_questions.jsonl

You should get a score of 43.6%. Note that this is a lower-bound --- qualitatively we found this classifier missed several paraphrase pairs with low lexical overlap, or cases where the retrieved training set question will have a super-set of the information needed to answer the validation set question.

Lower and Upper Bounds on ROUGE-L

Run the following to evaluate bounds on ROUGE-L. Make sure you have completed steps 1, 4 in the setup above. Scripts to evaluate other bounds involving training set retrieval coming soon!

cp generate_final_guess_bounds.py KILT/
cd KILT

# Copy input lowerbound, should get 20.0 ROUGE-L
python generate_final_guess_bounds.py --bound_type copy_input

# Random training set answer, should get 15.8-16.2 ROUGE-L depending on randomness
python generate_final_guess_bounds.py --bound_type random_train_ans

# "Performance" can be further boosted by randomly selecting from only longest answers
# for each training set question, up to ~16.7 ROUGE-L. This result is not reported in
# paper, but can be run using:
python generate_final_guess_bounds.py --bound_type random_train_ans_longest

# Longest gold answer upperbound, should get 21.2 ROUGE-L
python generate_final_guess_bounds.py --bound_type longest_gold

# Best gold answer upperbound, should get 26.2 ROUGE-L (takes a while to run, 45 min on single core)
python generate_final_guess_bounds.py --bound_type best_gold

Citation

If you found our paper or this repository useful, please cite:

@inproceedings{lfqa21,
author={Kalpesh Krishna and Aurko Roy and Mohit Iyyer},
Booktitle = {North American Association for Computational Linguistics},
Year = "2021",
Title={Hurdles to Progress in Long-form Question Answering},
}
Owner
Kalpesh Krishna
PhD student in Computer Science at UMass Amherst. Formerly IIT Bombay, @google-research, @mozilla, TTIC, @wncc.
Kalpesh Krishna
Official Pytorch implementation of Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations

Scene Representation Networks This is the official implementation of the NeurIPS submission "Scene Representation Networks: Continuous 3D-Structure-Aw

Vincent Sitzmann 365 Jan 06, 2023
Semi-supervised Stance Detection of Tweets Via Distant Network Supervision

SANDS This is an annonymous repository containing code and data necessary to reproduce the results published in "Semi-supervised Stance Detection of T

2 Sep 22, 2022
System Design course at HSE (2021)

System Design course at HSE (2021) Wiki-страница курса Структура репозитория: slides - директория с презентациями с занятий tasks - материалы для выпо

22 Dec 25, 2022
Weakly-supervised semantic image segmentation with CNNs using point supervision

Code for our ECCV paper What's the Point: Semantic Segmentation with Point Supervision. Summary This library is a custom build of Caffe for semantic i

27 Sep 14, 2022
Anderson Acceleration for Deep Learning

Anderson Accelerated Deep Learning (AADL) AADL is a Python package that implements the Anderson acceleration to speed-up the training of deep learning

Oak Ridge National Laboratory 7 Nov 24, 2022
PyTorch reimplementation of the paper Involution: Inverting the Inherence of Convolution for Visual Recognition [CVPR 2021].

Involution: Inverting the Inherence of Convolution for Visual Recognition Unofficial PyTorch reimplementation of the paper Involution: Inverting the I

Christoph Reich 100 Dec 01, 2022
An Implementation of SiameseRPN with Feature Pyramid Networks

SiameseRPN with FPN This project is mainly based on HelloRicky123/Siamese-RPN. What I've done is just add a Feature Pyramid Network method to the orig

3 Apr 16, 2022
Differentiable rasterization applied to 3D model simplification tasks

nvdiffmodeling Differentiable rasterization applied to 3D model simplification tasks, as described in the paper: Appearance-Driven Automatic 3D Model

NVIDIA Research Projects 336 Dec 30, 2022
💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 51 Jan 06, 2023
(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)

IsoTree Fast and multi-threaded implementation of Extended Isolation Forest, Fair-Cut Forest, SCiForest (a.k.a. Split-Criterion iForest), and regular

141 Dec 29, 2022
Algorithms for outlier, adversarial and drift detection

Alibi Detect is an open source Python library focused on outlier, adversarial and drift detection. The package aims to cover both online and offline d

Seldon 1.6k Dec 31, 2022
Contour-guided image completion with perceptual grouping (BMVC 2021 publication)

Contour-guided Image Completion with Perceptual Grouping Authors Morteza Rezanejad*, Sidharth Gupta*, Chandra Gummaluru, Ryan Marten, John Wilder, Mic

Sid Gupta 6 Dec 27, 2022
optimization routines for hyperparameter tuning

Hyperopt: Distributed Hyperparameter Optimization Hyperopt is a Python library for serial and parallel optimization over awkward search spaces, which

Marc Claesen 398 Nov 09, 2022
SHRIMP: Sparser Random Feature Models via Iterative Magnitude Pruning

SHRIMP: Sparser Random Feature Models via Iterative Magnitude Pruning This repository is the official implementation of "SHRIMP: Sparser Random Featur

Bobby Shi 0 Dec 16, 2021
Deep learning models for classification of 15 common weeds in the southern U.S. cotton production systems.

CottonWeeds Deep learning models for classification of 15 common weeds in the southern U.S. cotton production systems. requirements pytorch torchsumma

Dong Chen 8 Jun 07, 2022
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

DeCLIP Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. Our paper is available in arxiv Updates ** Ou

Sense-GVT 470 Dec 30, 2022
Code release for BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images

BlockGAN Code release for BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images BlockGAN: Learning 3D Object-aware Scene Rep

41 May 18, 2022
Disentangled Face Attribute Editing via Instance-Aware Latent Space Search, accepted by IJCAI 2021.

Instance-Aware Latent-Space Search This is a PyTorch implementation of the following paper: Disentangled Face Attribute Editing via Instance-Aware Lat

67 Dec 21, 2022
Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

BiDR Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval. Requirements torch==

Microsoft 11 Oct 20, 2022
A Bayesian cognition approach for belief updating of correlation judgement through uncertainty visualizations

Overview Code and supplemental materials for Karduni et al., 2020 IEEE Vis. "A Bayesian cognition approach for belief updating of correlation judgemen

Ryan Wesslen 1 Feb 08, 2022