CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

Related tags

Text Data & NLPCCQA
Overview

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

This is the official repository for the code and models of the paper CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training. If you use our dataset, code or any parts thereof, please cite this paper:

@misc{huber-etal-2021-ccqa,
  title={CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training}, 
  author={Patrick Huber and Armen Aghajanyan and Barlas Oğuz and Dmytro Okhonko and Wen-tau Yih and Sonal Gupta and Xilun Chen},
  year={2021},
  eprint={2110.07731},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Getting Common Crawl Snapshots

The Common Crawl project provides monthly web snapshots of new and updates websites in raw HTML format. Every monthly snapshot (~50-70TB) is further separated into smaller WARC (Web ARChive) files. To download a single WARC file, go to the Common Crawl website for the respective month (e.g. May 2021) and download the WARC paths file. The downloaded WARC paths file contains a \newline separated list of download destination of the actual files. Pick a path and prepend s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ for the complete URL. Once downloaded, gunzip the archive and a single Common Crawl web archive is ready to be processed.

Dataset Generation

Dependencies

Below are the required dependencies to run the dataset generation, curation and model evaluations.

  • Rust
  • Rust packages: clap, html-escape, indicatif, kuchiki, rayon, regex, serde, serde_json, warc (see Cargo.toml file for versions)
  • Python 3.7.3
  • Python dependencies: fasttext language identification, fasttext==0.9.2, lxml==4.3.2

Processing Common Crawl data (Rust)

  • Build the cargo package with cargo build from within the rust folder
  • Run the script with cargo run <path/to/warc/file> <path/to/output/file.mhtml>

Curating the minified HTML data (Python)

To generate json objects for every webpage in the minified HTML, run

python mhtml_to_json.py <path/to/fasttext/lid.176.bin> <path/to/mhtml/file> <path/to/output/file>

Aggregating datapoints to remove duplicate URL entries (Python)

As mentioned in the paper, we use the original dataset for our in-domain pre-training experiments. However, we also provide a cleaned version of the dataset, aggregating same-URL duplicates into a single object. To run the datapoint aggregation script, execute

python json_duplicate_filter.py <path/to/json/file> <path/to/output/file>

Converting json dataset into closed-book and passage retrieval formats (Python)

To be able to train closed-book (sequence-to-sequence) and passage retrieval (DPR) models on the CCQA dataset, the corpus needs to be further processed

Closed-book processing

To prepare the dataset for closed-book question-answering training, run:

python closed_book_processing.py <path/to/json/file> <path/to/output/file> <--only_english> <--keep_markup>

Passage retrieval (DPR) processing

To prepare the dataset for passage rertieval (DPR) training, run:

python passage_retrieval_processing.py <path/to/json/file> <path/to/output/file> <--only_english> <--keep_markup>

CCQA In-Domain Pre-Trained Model Checkpoints

BART and T5 checkpoints are Huggingface transformer models tested with transformers version 4.8.2

The DPR model checkpoint can be downloaded for the original DPR codebase or the DPR v2 codebase

LICENSE

The majority of CCQA is licensed under CC-BY-NC, however portions of the project are available under separate license terms: crowbook-text-processing is licensed under the MPL-2.0 license.

Owner
Meta Research
Meta Research
End-to-end MLOps pipeline of a BERT model for emotion classification.

image source EmoBERT-MLOps The goal of this repository is to build an end-to-end MLOps pipeline based on the MLOps course from Made with ML, but this

Dimitre Oliveira 4 Nov 06, 2022
STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

st3 STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch. Currently it supports converting pbmm models to pt scripts with integra

Vlad Ki 8 Oct 18, 2021
This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

Rachford-Rice Contest This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest. Can you solve the Rachford-Rice problem for all t

13 Sep 20, 2022
Pytorch NLP library based on FastAI

Quick NLP Quick NLP is a deep learning nlp library inspired by the fast.ai library It follows the same api as fastai and extends it allowing for quick

Agis pof 283 Nov 21, 2022
Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

Jeong Ukjae 13 Sep 02, 2022
Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers. Cherche is meant to be used with small to medium sized corpora. C

Raphael Sourty 224 Nov 29, 2022
Espial is an engine for automated organization and discovery of personal knowledge

Live Demo (currently not running, on it) Espial is an engine for automated organization and discovery in knowledge bases. It can be adapted to run wit

Uzay-G 159 Dec 30, 2022
A Pytorch implementation of "Splitter: Learning Node Representations that Capture Multiple Social Contexts" (WWW 2019).

Splitter ⠀⠀ A PyTorch implementation of Splitter: Learning Node Representations that Capture Multiple Social Contexts (WWW 2019). Abstract Recent inte

Benedek Rozemberczki 201 Nov 09, 2022
Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

2 Jul 05, 2022
DAGAN - Dual Attention GANs for Semantic Image Synthesis

Contents Semantic Image Synthesis with DAGAN Installation Dataset Preparation Generating Images Using Pretrained Model Train and Test New Models Evalu

Hao Tang 104 Oct 08, 2022
Using BERT-based models for toxic span detection

SemEval 2021 Task 5: Toxic Spans Detection: Task: Link to SemEval-2021: Task 5 Toxic Span Detection is https://competitions.codalab.org/competitions/2

Ravika Nagpal 1 Jan 04, 2022
Speach Recognitions

easy_meeting Добро пожаловать в интерфейс сервиса автопротоколирования совещаний Easy Meeting. Website - http://cf5c-62-192-251-83.ngrok.io/ Принципиа

Maksim 3 Feb 18, 2022
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language This repository contains UA-GEC data and an accompanying Python lib

Grammarly 227 Jan 02, 2023
This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe

Advent-of-cyber-2019-writeup This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe https://tryhackme.com/shivam007/badges/c

shivam danawale 5 Jul 17, 2022
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
多语言降噪预训练模型MBart的中文生成任务

mbart-chinese 基于mbart-large-cc25 的中文生成任务 Input source input: text + /s + lang_code target input: lang_code + text + /s Usage token_ids_mapping.jso

11 Sep 19, 2022
Natural Language Processing

NLP Natural Language Processing apps Multilingual_NLP.py start #This script is demonstartion of Mul

Ritesh Sharma 1 Oct 31, 2021
PyTorch source code of NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models"

This repository contains source code for NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models" (P

Alexandra Chronopoulou 89 Aug 12, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

The Blue Brain Project 29 Dec 01, 2022