Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

Overview

Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

This repository maintains some utility scripts for retrieving and preprocessing Wikipedia text for natural language processing (NLP) research.

Note: The scripts have been created for and tested with the Japanese version of Wikipedia only.

Preprocessed files

Some of the preprocessed files generated by this repo's scripts can be downloaded from the Releases page.

All the preprocessed files are distributed under the CC-BY-SA 3.0 and GFDL licenses. For more information, see the License section below.

Example usage of the scripts

Get Wikipedia page ids from a Cirrussearch dump file

get_all_page_ids_from_cirrussearch.py

This script extracts the page ids and revision ids of all pages from a Wikipedia Cirrussearch dump file (available from this site.) It also adds the following information to each item based on the information in the dump file:

  • "num_inlinks": the number of incoming links to the page.
  • "is_disambiguation_page": whether the page is a disambiguation page.
  • "is_sexual_page": whether the page is labeled containing sexual contents.
  • "is_violent_page": whether the page is labeled containing violent contents.
$ python get_all_page_ids_from_cirrussearch.py \
--cirrus_file ~/data/wikipedia/cirrussearch/20211129/jawiki-20211129-cirrussearch-content.json.gz \
--output_file ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json

# If you want the output file sorted by the page id:
$ cat ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json|jq -s -c 'sort_by(.pageid)[]' > ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129-sorted.json
$ mv ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129-sorted.json ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json

The script outputs a JSON Lines file containing following items, one item per line:

{
    "title": "アンパサンド",
    "pageid": 5,
    "revid": 85364431,
    "num_inlinks": 231,
    "is_disambiguation_page": false,
    "is_sexual_page": false,
    "is_violent_page": false
}

Get Wikipedia page HTMLs

get_page_htmls.py

This script fetches HTML contents of the Wikipedia pages specified by the page ids in the input file. It makes use of Wikimedia REST API to accsess the contents of Wikipedia pages.

Important: Be sure to check the terms and conditions of the API documented in the official page. Especially, you may not send more than 200 requests/sec to the API. You should also set your contact information (e.g., email address) in the User-Agent header so that Wikimedia can contact you quickly if necessary.

# It takes about 2 days to fetch all the articles in Japanese Wikipedia
$ python get_page_htmls.py \
--page_ids_file ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json \
--output_file ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json.gz \
--language ja \
--user_agent <your_contact_information> \
--batch_size 20 \
--mobile

# If you want the output file sorted by the page id:
$ zcat ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json.gz|jq -s -c 'sort_by(.pageid)[]'|gzip > ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129-sorted.json.gz
$ mv ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129-sorted.json.gz ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json.gz

# Splitting the file for distribution
$ gunzip ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json.gz
$ split -n l/5 --numeric-suffixes=1 --additional-suffix=.json ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.
$ gzip ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.*.json
$ gzip ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json

The script outputs a gzipped JSON Lines file containing following items, one item per line:

{
  "title": "アンパサンド",
  "pageid": 5,
  "revid": 85364431,
  "url": "https://ja.wikipedia.org/api/rest_v1/page/mobile-html/%E3%82%A2%E3%83%B3%E3%83%91%E3%82%B5%E3%83%B3%E3%83%89/85364431",
  "html": "
}

Extract paragraphs from the Wikipedia page HTMLs

extract_paragraphs_from_page_htmls.py

This script extracts paragraph texts from a Wikipedia page HTMLs file generated by get_page_htmls.py. You can specify the minimum and maximum length of the paragraph texts to be extracted.

# This produces 8,921,367 paragraphs
$ python extract_paragraphs_from_page_htmls.py \
--page_htmls_file ~/work/wikipedia-utils/20211129/page-htmls-jawiki-20211129.json.gz \
--output_file ~/work/wikipedia-utils/20211129/paragraphs-jawiki-20211129.json.gz \
--min_paragraph_length 10 \
--max_paragraph_length 1000

Make a plain text corpus of Wikipedia paragraph/page texts

make_corpus_from_paragraphs.py

This script produces a plain text corpus file from a paragraphs file generated by extract_paragraphs_from_page_htmls.py. You can optionally filter out disambiguation/sexual/violent pages from the output file by specifying the corresponding command line options.

Here we use mecab-ipadic-NEologd in splitting texts into sentences so that some sort of named entities will not be split into sentences.

The output file is a gzipped text file containing one sentence per line, with the pages separated by blank lines.

# 22,651,544 lines from all pages
$ python make_corpus_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20211129/paragraphs-jawiki-20211129.json.gz \
--output_file ~/work/wikipedia-utils/20211129/corpus-jawiki-20211129.txt.gz \
--mecab_option '-d /usr/local/lib/mecab/dic/ipadic-neologd-v0.0.7' \
--min_sentence_length 10 \
--max_sentence_length 1000

# 18,721,087 lines from filtered pages
$ python make_corpus_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20211129/paragraphs-jawiki-20211129.json.gz \
--output_file ~/work/wikipedia-utils/20211129/corpus-jawiki-20211129-filtered-large.txt.gz \
--page_ids_file ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json \
--mecab_option '-d /usr/local/lib/mecab/dic/ipadic-neologd-v0.0.7' \
--min_sentence_length 10 \
--max_sentence_length 1000 \
--min_inlinks 10 \
--exclude_sexual_pages

make_corpus_from_cirrussearch.py

This script produces a plain text corpus file by simply taking the text attributes of pages from a Wikipedia Cirrussearch dump file.

The resulting corpus file will be somewhat different from the one generated by make_corpus_from_paragraphs.py due to some differences in text processing. In addition, since the text attributes in the Cirrussearch dump file does not retain the page structure, it is less flexible to modify the processing of text compared to processing an HTML file with make_corpus_from_paragraphs.py.

$ python make_corpus_from_cirrussearch.py \
--cirrus_file ~/data/wikipedia/cirrussearch/20211129/jawiki-20211129-cirrussearch-content.json.gz \
--output_file ~/work/wikipedia-utils/20211129/corpus-jawiki-20211129-cirrus.txt.gz \
--min_inlinks 10 \
--exclude_sexual_pages \
--mecab_option '-d /usr/local/lib/mecab/dic/ipadic-neologd-v0.0.7'

Make a passages file from extracted paragraphs

make_passages_from_paragraphs.py

This script takes a paragraphs file generated by extract_paragraphs_from_page_htmls.py and splits the paragraph texts into a collection of pieces of texts called passages (sections/paragraphs/sentences).

It is useful for creating texts of a reasonable length that can be handled by passage-retrieval systems such as DPR.

# Make single passage from one paragraph
# 8,672,661 passages
$ python make_passages_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20211129/paragraphs-jawiki-20211129.json.gz \
--output_file ~/work/wikipedia-utils/20211129/passages-para-jawiki-20211129.json.gz \
--passage_unit paragraph \
--passage_boundary section \
--max_passage_length 400

# Make single passage from consecutive sentences within a section
# 5,170,346 passages
$ python make_passages_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20211129/paragraphs-jawiki-20211129.json.gz \
--output_file ~/work/wikipedia-utils/20211129/passages-c400-jawiki-20211129.json.gz \
--passage_unit sentence \
--passage_boundary section \
--max_passage_length 400 \
--as_long_as_possible

Build Elasticsearch indices of Wikipedia passages/pages

Requirements

  • Elasticsearch 6.x with several plugins installed
# For running build_es_index_passages.py
$ ./bin/elasticsearch-plugin install analysis-kuromoji

# For running build_es_index_cirrussearch.py (Elasticsearch 6.5.4 is needed)
$ ./bin/elasticsearch-plugin install analysis-icu
$ ./bin/elasticsearch-plugin install org.wikimedia.search:extra:6.5.4

build_es_index_passages.py

This script builds an Elasticsearch index of passages generated by make_passages_from_paragraphs.

$ python build_es_index_passages.py \
--passages_file ~/work/wikipedia-utils/20211129/passages-para-jawiki-20211129.json.gz \
--page_ids_file ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json \
--index_name jawiki-20211129-para

$ python build_es_index_passages.py \
--passages_file ~/work/wikipedia-utils/20211129/passages-c400-jawiki-20211129.json.gz \
--page_ids_file ~/work/wikipedia-utils/20211129/page-ids-jawiki-20211129.json \
--index_name jawiki-20211129-c400

build_es_index_cirrussearch.py

This script builds an Elasticsearch index of Wikipedia pages using a Cirrussearch dump file. Cirrussearch dump files are originally for Elasticsearch bulk indexing, so this script simply takes the page information from the dump file to build an index.

$ python build_es_index_cirrussearch.py \
--cirrus_file ~/data/wikipedia/cirrussearch/20211129/jawiki-20211129-cirrussearch-content.json.gz \
--index_name jawiki-20211129-cirrus \
--language ja

License

The content of Wikipedia, which can be obtained with the codes in this repository, is licensed under the CC-BY-SA 3.0 and GFDL licenses.

The codes in this repository are licensed under the Apache License 2.0.

You might also like...
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Lingtrain Aligner — ML powered library for the accurate texts alignment.
Lingtrain Aligner — ML powered library for the accurate texts alignment.

Lingtrain Aligner ML powered library for the accurate texts alignment in different languages. Purpose Main purpose of this alignment tool is to build

Augmenty is an augmentation library based on spaCy for augmenting texts.
Augmenty is an augmentation library based on spaCy for augmenting texts.

Augmenty: The cherry on top of your NLP pipeline Augmenty is an augmentation library based on spaCy for augmenting texts. Besides a wide array of high

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

This library is testing the ethics of language models by using natural adversarial texts.
This library is testing the ethics of language models by using natural adversarial texts.

prompt2slip This library is testing the ethics of language models by using natural adversarial texts. This tool allows for short and simple code and v

Biterm Topic Model (BTM): modeling topics in short texts
Biterm Topic Model (BTM): modeling topics in short texts

Biterm Topic Model Bitermplus implements Biterm topic model for short texts introduced by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. Actua

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

Text Classification in Turkish Texts with Bert
Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

Releases(2022-04-04)
Owner
Masatoshi Suzuki
Masatoshi Suzuki
Generate a cool README/About me page for your Github Profile

Github Profile README/ About Me Generator 💯 This webapp lets you build a cool README for your profile. A few inputs + ~15 mins = Your Github Profile

Rahul Banerjee 179 Jan 07, 2023
NLP topic mdel LDA - Gathered from New York Times website

NLP topic mdel LDA - Gathered from New York Times website

1 Oct 14, 2021
Python library for Serbian Natural language processing (NLP)

SrbAI - Python biblioteka za procesiranje srpskog jezika SrbAI je projekat prikupljanja algoritama i modela za procesiranje srpskog jezika u jedinstve

Serbian AI Society 3 Nov 22, 2022
Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Import Subtitles for Blender VSE Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module. Supported formats by py

4 Feb 27, 2022
Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow. Documentation Proper documentation is available at

HUSEIN ZOLKEPLI 151 Jan 05, 2023
[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

New Benchmarks for Learning on Non-Homophilous Graphs Here are the codes and datasets accompanying the paper: New Benchmarks for Learning on Non-Homop

94 Dec 21, 2022
A minimal code for fairseq vq-wav2vec model inference.

vq-wav2vec inference A minimal code for fairseq vq-wav2vec model inference. Runs without installing the fairseq toolkit and its dependencies. Usage ex

Vladimir Larin 7 Nov 15, 2022
Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers. Cherche is meant to be used with small to medium sized corpora. C

Raphael Sourty 224 Nov 29, 2022
Sequence modeling benchmarks and temporal convolutional networks

Sequence Modeling Benchmarks and Temporal Convolutional Networks (TCN) This repository contains the experiments done in the work An Empirical Evaluati

CMU Locus Lab 3.5k Jan 03, 2023
NLP - Machine learning

Flipkart-product-reviews NLP - Machine learning About Product reviews is an essential part of an online store like Flipkart’s branding and marketing.

Harshith VH 1 Oct 29, 2021
Search msDS-AllowedToActOnBehalfOfOtherIdentity

前言 现在进行RBCD的攻击手段主要是搜索mS-DS-CreatorSID,如果机器的创建者是我们可控的话,那就可以修改对应机器的msDS-AllowedToActOnBehalfOfOtherIdentity,利用工具SharpAllowedToAct-Modify 那我们索性也试试搜索所有计算机

Jumbo 26 Dec 05, 2022
Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

Graph-Bert Source code of "Graph-Bert: Only Attention is Needed for Learning Graph Representations". Please check the script.py as the entry point. We

14 Mar 25, 2022
Sequence-to-Sequence Framework in PyTorch

nmtpytorch allows training of various end-to-end neural architectures including but not limited to neural machine translation, image captioning and au

LIUM 395 Nov 21, 2022
Code for hyperboloid embeddings for knowledge graph entities

Implementation for the papers: Self-Supervised Hyperboloid Representations from Logical Queries over Knowledge Graphs, Nurendra Choudhary, Nikhil Rao,

30 Dec 10, 2022
SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning We propose a SASE mode

Tower 1 Nov 20, 2021
Automatic privilege escalation for misconfigured capabilities, sudo and suid binaries

GTFONow Automatic privilege escalation for misconfigured capabilities, sudo and suid binaries. Features Automatically escalate privileges using miscon

101 Jan 03, 2023
✨Fast Coreference Resolution in spaCy with Neural Networks

✨ NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks. NeuralCoref is a pipeline extension for spaCy 2.1+ which annotates and resolv

Hugging Face 2.6k Jan 04, 2023
Code repository for "It's About Time: Analog clock Reading in the Wild"

it's about time Code repository for "It's About Time: Analog clock Reading in the Wild" Packages required: pytorch (used 1.9, any reasonable version s

52 Nov 10, 2022
Perform sentiment analysis on textual data that people generally post on websites like social networks and movie review sites.

Sentiment Analyzer The goal of this project is to perform sentiment analysis on textual data that people generally post on websites like social networ

Madhusudan.C.S 53 Mar 01, 2022
IEEEXtreme15.0 Questions And Answers

IEEEXtreme15.0 Questions And Answers IEEEXtreme is a global challenge in which teams of IEEE Student members – advised and proctored by an IEEE member

Dilan Perera 15 Oct 24, 2022