ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

Overview

ThinkTwice

ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension. Authors are Mengxing Dong, Bowei Zou, Jin Qian, Rongtao Huang and Yu Hong from Soochow University and Institute for Infocomm Research. The paper will be published in NLPCC 2021 soon.

Contents

Background

Our idea is mainly inspired by the way humans think: We first read a lengthy document and remain several slices which are important to our task in our mind; then we are gonna capture the final answer within this limited information.

The goals for this repository are:

  1. A complete code for NewsQA. This repo offers an implement for dealing with long text MRC dataset NewsQA; you can also try this method on other datsets like TriviaQA, Natural Questions yourself.
  2. A comparison description. The performance on ThinkTwice has been listed in the paper.
  3. A public space for advice. You are welcomed to propose an issue in this repo.

Requirements

Clone this repo at your local server. Install necessary libraries listed below.

git clone [email protected]:Walle1493/ThinkTwice.git
pip install requirements.txt

You may install several libraries on yourself.

Dataset

You need to prepare data in a squad2-like format. Since NewsQA (click here seeing more) is similar to SQuAD-2.0, we don't offer the script in this repo. The demo data format is showed below:

"version": "1",
"data": [
    {
        "type": "train",
        "title": "./cnn/stories/42d01e187213e86f5fe617fe32e716ff7fa3afc4.story",
        "paragraphs": [
            {
                "context": "NEW DELHI, India (CNN) -- A high court in northern India on Friday acquitted a wealthy...",
                "qas": [
                    {
                        "question": "What was the amount of children murdered?",
                        "id": "./cnn/stories/42d01e187213e86f5fe617fe32e716ff7fa3afc4.story01",
                        "answers": [
                            {
                                "answer_start": 294,
                                "text": "19"
                            }
                        ],
                        "is_impossible": false
                    },
                    {
                        "question": "When was Pandher sentenced to death?",
                        "id": "./cnn/stories/42d01e187213e86f5fe617fe32e716ff7fa3afc4.story02",
                        "answers": [
                            {
                                "answer_start": 261,
                                "text": "February"
                            }
                        ],
                        "is_impossible": false
                    }
                ]
            }
        ]
    }
]

P.S.: You are supposed to make a change when dealing with other datasets like TriviaQA or Natural Questions, because we split passages by '\n' character in NewsQA, while not all the same in other datasets.

Train

The training step (including test module) depends mainly on these parameters. We trained our two-stage model on 4 GPUs with 12G 1080Ti in 60 hours.

python code/main.py \
  --do_train \
  --do_eval \
  --eval_test \
  --model bert-base-uncased \
  --train_file ~/Data/newsqa/newsqa-squad2-dataset/squad-newsqa-train.json \
  --dev_file ~/Data/newsqa/newsqa-squad2-dataset/squad-newsqa-dev.json \
  --test_file ~/Data/newsqa/newsqa-squad2-dataset/squad-newsqa-test.json \
  --train_batch_size 256 \
  --train_batch_size_2 24 \
  --eval_batch_size 32  \
  --learning_rate 2e-5 \
  --num_train_epochs 1 \
  --num_train_epochs_2 3 \
  --max_seq_length 128 \
  --max_seq_length_2 512 \
  --doc_stride 128 \
  --eval_metric best_f1 \
  --output_dir outputs/newsqa/retr \
  --output_dir_2 outputs/newsqa/read \
  --data_binary_dir data_binary/retr \
  --data_binary_dir_2 data_binary/read \
  --version_2_with_negative \
  --do_lower_case \
  --top_k 5 \
  --do_preprocess \
  --do_preprocess_2 \
  --first_stage \

In order to improve efficiency, we store data and model generated during training in a binary format. Specifically, when you switch on do_preprocess, the converted data in the first stage will be stored in the directory data_binary, next time you can switch off this option to directly load data. As well, do_preprocess aims at the data in the second stage, and first_stage is for the retriever model. The model and metrics result can be found in the directory output/newsqa after training.

License

Soochow University © Mengxing Dong

Owner
Walle
Walle
Converts python code into c++ by using OpenAI CODEX.

🦾 codex_py2cpp 🤖 OpenAI Codex Python to C++ Code Generator Your Python Code is too slow? 🐌 You want to speed it up but forgot how to code in C++? ⌨

Alexander 423 Jan 01, 2023
ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

ThinkTwice ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A

Walle 4 Aug 06, 2021
Nateve compiler developed with python.

Adam Adam is a Nateve Programming Language compiler developed using Python. Nateve Nateve is a new general domain programming language open source ins

Nateve 7 Jan 15, 2022
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

Antlr Project 13.6k Jan 05, 2023
Code for lyric-section-to-comment generation based on huggingface transformers.

CommentGeneration Code for lyric-section-to-comment generation based on huggingface transformers. Migrate Guyu model and code (both 12-layers and 24-l

Yawei Sun 8 Sep 04, 2021
Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

Ceyda Cinarel 22 Nov 16, 2022
Entity Disambiguation as text extraction (ACL 2022)

ExtEnD: Extractive Entity Disambiguation This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Dis

Sapienza NLP group 121 Jan 03, 2023
Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Chenyang Huang 37 Jan 04, 2023
LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

LightSeq is a high performance inference library for sequence processing and generation implemented in CUDA. It enables highly efficient computation of modern NLP models such as BERT, GPT2, Transform

Bytedance Inc. 2.5k Jan 03, 2023
A method to generate speech across multiple speakers

VoiceLoop PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. VoiceLoop is a n

Facebook Archive 873 Dec 15, 2022
Translation to python of Chris Sims' optimization function

pycsminwel This is a locol minimization algorithm. Uses a quasi-Newton method with BFGS update of the estimated inverse hessian. It is robust against

Gustavo Amarante 1 Mar 21, 2022
This repository describes our reproducible framework for assessing self-supervised representation learning from speech

LeBenchmark: a reproducible framework for assessing SSL from speech Self-Supervised Learning (SSL) using huge unlabeled data has been successfully exp

49 Aug 24, 2022
Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

ERNIE Source code and dataset for "ERNIE: Enhanced Language Representation with Informative Entities" Reqirements: Pytorch=0.4.1 Python3 tqdm boto3 r

THUNLP 1.3k Dec 30, 2022
运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。

OlittleRer 运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。编程语言和工具包括Java、Python、Matlab、CPLEX、Gurobi、SCIP 等。 关注我们: 运筹小公众号 有问题可以直接在

运小筹 151 Dec 30, 2022
Sequence model architectures from scratch in PyTorch

This repository implements a variety of sequence model architectures from scratch in PyTorch. Effort has been put to make the code well structured so that it can serve as learning material. The train

Brando Koch 11 Mar 28, 2022
Machine learning models from Singapore's NLP research community

SG-NLP Machine learning models from Singapore's natural language processing (NLP) research community. sgnlp is a Python package that allows you to eas

AI Singapore | AI Makerspace 21 Dec 17, 2022
Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

Words_And_Phrases Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours Abbreviations Abbreviation

Subhadeep Mandal 1 Feb 01, 2022
A curated list of efficient attention modules

awesome-fast-attention A curated list of efficient attention modules

Sepehr Sameni 891 Dec 22, 2022
Comprehensive-E2E-TTS - PyTorch Implementation

A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultima

Keon Lee 114 Nov 13, 2022
Uses Google's gTTS module to easily create robo text readin' on command.

Tool to convert text to speech, creating files for later use. TTRS uses Google's gTTS module to easily create robo text readin' on command.

0 Jun 20, 2021