Ditch the Gold Standard: Re-evaluating Conversational Question Answering

This is the repository for our ACL'2022 paper Ditch the Gold Standard: Re-evaluating Conversational Question Answering. The slides for our ACL presentation can be found here.

Quick links

Overview
Human Evaluation Dataset
Automatic model evaluation interface
Setup
- Install dependencies
- Download the datasets
Evaluating existing models
- BERT
- GraphFlow
- HAM
- ExCorD
Evaluating your own model
Citation

Overview

In this work, we conduct the first large-scale human evaluation of state-of-the-art conversational QA systems. In our evaluation, human annotators chat with conversational QA models about passages from the QuAC development set, and after that the annotators judge the correctness of model answers. We release the human annotated dataset in the following section.

We also identify a critical issue with the current automatic evaluation, which pre-collectes human-human conversations and uses ground-truth answers as conversational history (differences between different evaluations are shown in the following figure). By comparison, we find that the automatic evaluation does not always agree with the human evaluation. We propose a new evaluation protocol that is based on predicted history and question rewriting. Our experiments show that the new protocol better reflects real-world performance compared to the original automatic evaluation. We also provide the new evaluation protocol code in the following.

Human Evaluation Dataset

You can download the human annotation dataset from data/human_annotation_data.json. The json file is structured as follows:

{"data": 
      [{
       # The model evaluated. One of `bert4quac`, `graphflow`, `ham`, `excord`
       "model_name": "graphflow",

       # The passage used in this conversation.
       "context": "Azaria wrote and directed the 2004 short film Nobody's Perfect, ...",

       # The ID from the original QuAC dataset.
       "dialog_id": "C_f0555dd820d84564a189474bbfffd4a1_1_0",

       # The conversation, which contains a list of QA pairs.
       "qas": [{

         # The number of the turn
         "turn_id": 0,

         # The question from the human annotator
         "question": "What is some voice work he's done?",

         # The answer from the model
         "answer": "Azaria wrote and directed the 2004 short film Nobody's Perfect,",

         # Whether the question is valid (annotated by our human annotator)
         "valid": "y",

         # Whether the question is answerable (annotated by our human annotator)
         "answerable": "y",

         # Whether the model's answer is correct (annotated by our human annotator)
         "correct": "y",
         
         # Human annotator selects an answer, ONLY IF they marked the answer as incorrect
         "gold_anno": ["Azaria wrote and directed ..."]
         },
         ...
       ]
      },
      ...
]

Automatic model evaluation interface

We provide a convenient interface to test model performance on a few evaluation protocols compared in our paper, including Auto-Pred, Auto-Replace and our proposed evaluation protocol, Auto-Rewrite, which better demonstrates models' performance in human-model conversations. Please refer to our paper for more details. Following is a figure describing how Auto-Rewrite works.

Setup

Install dependencies

Please install all dependency packages using the following command:

pip install -r requirements.txt

Download the datasets

Our experiments use QuAC dataset for passages and conversations, and the test set of CANARD dataset for context-independent questions in Auto-Replace.

Evaluating existing models

We provide our implementations for the four models that we used in our paper: BERT, GraphFlow, HAM, ExCorD. We modified exisiting implementation online to use model predictions as conversation history. Below are the instructions to run evaluation script on each of these models.

BERT

We implemented and trained our own BERT model.

# Run Training
python run_quac_train.py \
  --type bert \
  --model_name_or_path bert-base-uncased \
  --do_train \
  --output_dir ${directory_to_save_model} \
  --overwrite_output_dir \
  --train_file ${path_to_quac_train_file} \
  --train_batch_size 8 \
  --gradient_accumulation_steps 4 \
  --max_seq_length 512 \
  --learning_rate 3e-5 \
  --history_len 2 \
  --warmup_proportion 0.1 \
  --max_grad_norm -1 \
  --weight_decay 0.01 \
  --rationale_beta 0 \ # important for BERT

# Run Evaluation (Auto-Rewrite as example)
python run_quac_eval.py \
  --type bert \
  --output_dir ${directory-to-model-checkpoint} \
  --write_dir ${directory-to-write-evaluation-result} \
  --predict_file val_v0.2.json \
  --max_seq_length 512 \
  --doc_stride 128 \
  --max_query_length 64 \
  --match_metric f1 \
  --add_background \
  --skip_entity \
  --rewrite \
  --start_i ${index_of_first_passage_to_eval} \
  --end_i ${index_of_last_passage_to_eval_exclusive} \

GraphFlow

We did not find an uploaded model checkpoint so we trained our own using their training script.

# Download Stanford CoreNLP package
wget https://nlp.stanford.edu/software/stanford-corenlp-latest.zip
unzip stanford-corenlp-latest.zip
rm -f stanford-corenlp-latest.zip

# Start StanfordCoreNLP server
java -mx4g -cp "${directory_to_standford_corenlp_package}" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 &

# Run Evaluation (Auto-Rewrite as example)
python run_quac_eval.py \
    --type graphflow \
    --predict_file ${path-to-annotated-dev-json-file} \
    --output_dir ${directory-to-model-checkpoint} \
    --saved_vocab_file ${directory-to-saved-model-vocab} \
    --pretrained ${directory-to-model-checkpoint} \
    --write_dir /n/fs/scratch/huihanl/unified/graphflow/write \
    --match_metric f1 \
    --add_background \
    --skip_entity \
    --rewrite \
    --fix_vocab_embed \
    --f_qem \
    --f_pos \
    --f_ner \
    --use_ques_marker \
    --use_gnn \
    --temporal_gnn \
    --use_bert \
    --use_bert_weight \
    --shuffle \
    --out_predictions \
    --predict_raw_text \
    --out_pred_in_folder \
    --optimizer adamax \
    --start_i ${index_of_first_passage_to_eval} \
    --end_i ${index_of_last_passage_to_eval_exclusive} \

HAM

The orgininal model checkpoint can be downloaded from CodaLab

# Run Evaluation (Auto-Rewrite as example)
python run_quac_eval.py \
  --type ham \
  --output_dir ${directory-to-model-checkpoint} \
  --write_dir ${directory-to-write-evaluation-result} \
  --predict_file val_v0.2.json \
  --max_seq_length 512 \
  --doc_stride 128 \
  --max_query_length 64 \
  --do_lower_case \
  --history_len 6 \
  --match_metric f1 \
  --add_background \
  --skip_entity \
  --replace \
  --init_checkpoint ${directory-to-model-checkpoint}/model_52000.ckpt \
  --bert_config_file ${directory-to-pretrained-bert-large-uncased}/bert_config.json \
  --vocab_file ${directory-to-model-checkpoint}/vocab.txt \
  --MTL_mu 0.8 \
  --MTL_lambda 0.1 \
  --mtl_input reduce_mean \
  --max_answer_length 40 \
  --max_considered_history_turns 4 \
  --bert_hidden 1024 \
  --fine_grained_attention \
  --better_hae \
  --MTL \
  --use_history_answer_marker \
  --start_i ${index_of_first_passage_to_eval} \
  --end_i ${index_of_last_passage_to_eval_exclusive} \

ExCorD

The original model checkpoint can be downloaded from their repo

# Run Evaluation (Auto-Rewrite as example)
python run_quac_eval.py \
  --type excord \
  --output_dir ${directory-to-model-checkpoint} \
  --write_dir ${directory-to-write-evaluation-result} \
  --predict_file val_v0.2.json \
  --max_seq_length 512 \
  --doc_stride 128 \
  --max_query_length 64 \
  --match_metric f1 \
  --add_background \
  --skip_entity \
  --rewrite \
  --start_i ${index_of_first_passage_to_eval} \
  --end_i ${index_of_last_passage_to_eval_exclusive} \

Evaluating your own model

One can follow our existing implementations for the four models to implement evaluation for their own models. To do so, please add a directory under models and write a customized model class following the template interface.py and our example implementations.

Citation

@inproceedings{li2022ditch,
    title = "Ditch the Gold Standard: Re-evaluating Conversational Question Answering",
    author = "Li, Huihan  and
      Gao, Tianyu  and
      Goenka, Manan  and
      Chen, Danqi",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2022",
    url = "https://aclanthology.org/2022.acl-long.555",
    pages = "8074--8085",
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data		data
figs		figs
models		models
.gitignore		.gitignore
ACL 2022 Video talk.pdf		ACL 2022 Video talk.pdf
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
coreference_resolution.py		coreference_resolution.py
interface.py		interface.py
requirements.txt		requirements.txt
run.sh		run.sh
run_quac_eval.py		run_quac_eval.py
run_quac_eval_util.py		run_quac_eval_util.py

License

princeton-nlp/EvalConvQA

Folders and files

Latest commit

History

Repository files navigation

Ditch the Gold Standard: Re-evaluating Conversational Question Answering

Quick links

Overview

Human Evaluation Dataset

Automatic model evaluation interface

Setup

Install dependencies

Download the datasets

Evaluating existing models

BERT

GraphFlow

HAM

ExCorD

Evaluating your own model

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages