Skip to content

singletongue/wikipedia-utils

Repository files navigation

Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

This repository maintains some utility scripts for retrieving and preprocessing Wikipedia text for natural language processing (NLP) research. Some of the preprocessed files are also available at Hugging Face Hub.

Note: The scripts have been created for and tested with the Japanese version of Wikipedia only.

Preprocessed files

Some of the preprocessed files generated by this repo's scripts can be downloaded from the Releases page. Some of them are also available at Hugging Face Hub.

All the preprocessed files are distributed under the CC-BY-SA 3.0 and GFDL licenses. For more information, see the License section below.

Example usage of the scripts

Get Wikipedia page ids from a Cirrussearch dump file

This script extracts the page ids and revision ids of all pages from a Wikipedia Cirrussearch dump file (available from this site.) It also adds the following information to each item based on the information in the dump file:

  • "num_inlinks": the number of incoming links to the page.
  • "is_disambiguation_page": whether the page is a disambiguation page.
  • "is_sexual_page": whether the page is labeled containing sexual contents.
  • "is_violent_page": whether the page is labeled containing violent contents.
$ python get_all_page_ids_from_cirrussearch.py \
--cirrus_file ~/data/wikipedia/cirrussearch/20240401/jawiki-20240401-cirrussearch-content.json.gz \
--output_file ~/work/wikipedia-utils/20240401/page-ids-jawiki-20240401.json

# If you want the output file sorted by the page id:
$ cat ~/work/wikipedia-utils/20240401/page-ids-jawiki-20240401.json|jq -s -c 'sort_by(.pageid)[]' > ~/work/wikipedia-utils/20240401/page-ids-jawiki-20240401-sorted.json
$ mv ~/work/wikipedia-utils/20240401/page-ids-jawiki-20240401-sorted.json ~/work/wikipedia-utils/20240401/page-ids-jawiki-20240401.json

The script outputs a JSON Lines file containing following items, one item per line:

{
  "title": "アンパサンド",
  "pageid": 5,
  "revid": 99347164,
  "num_inlinks": 279,
  "is_disambiguation_page": false,
  "is_sexual_page": false,
  "is_violent_page": false
}

Get Wikipedia page HTMLs

This script fetches HTML contents of the Wikipedia pages specified by the page ids in the input file. It makes use of Wikimedia REST API to accsess the contents of Wikipedia pages.

Important: Be sure to check the terms and conditions of the API documented in the official page. Especially, as of this writing, you may not send more than 200 requests/sec to the API. You should also set your contact information (e.g., email address) in the User-Agent header so that Wikimedia can contact you quickly if necessary.

$ split -n l/10 --numeric-suffixes=1 --additional-suffix=.json ~/work/wikipedia-utils/20240401/page-ids-jawiki-20240401.json ~/work/wikipedia-utils/20240401/page-ids-jawiki-20240401.

# It takes about 1 or 2 days to fetch all the articles in Japanese Wikipedia
$ for i in `seq -f %02g 1 10`; do \
python get_page_htmls.py \
--page_ids_file ~/work/wikipedia-utils/20240401/page-ids-jawiki-20240401.$i.json \
--output_file ~/work/wikipedia-utils/20240401/page-htmls-jawiki-20240401.$i.json.gz \
--language ja \
--user_agent <your_contact_information> \
--batch_size 50 ; \
done

# If you want the output file sorted by the page id:
$ for i in `seq -f %02g 1 10`; do \
zcat ~/work/wikipedia-utils/20240401/page-htmls-jawiki-20240401.$i.json.gz|jq -s -c 'sort_by(.pageid)[]'|gzip > ~/work/wikipedia-utils/20240401/page-htmls-jawiki-20240401-sorted.$i.json.gz && \
mv ~/work/wikipedia-utils/20240401/page-htmls-jawiki-20240401-sorted.$i.json.gz ~/work/wikipedia-utils/20240401/page-htmls-jawiki-20240401.$i.json.gz ; \
done

$ zcat ~/work/wikipedia-utils/20240401/page-htmls-jawiki-20240401.*.json.gz|gzip > ~/work/wikipedia-utils/20240401/page-htmls-jawiki-20240401.json.gz

# Splitting the file for distribution
$ gunzip ~/work/wikipedia-utils/20240401/page-htmls-jawiki-20240401.json.gz
$ split -n l/10 --numeric-suffixes=1 --additional-suffix=.json ~/work/wikipedia-utils/20240401/page-htmls-jawiki-20240401.json ~/work/wikipedia-utils/20240401/page-htmls-jawiki-20240401.
$ gzip ~/work/wikipedia-utils/20240401/page-htmls-jawiki-20240401.*.json
$ gzip ~/work/wikipedia-utils/20240401/page-htmls-jawiki-20240401.json

The script outputs a gzipped JSON Lines file containing following items, one item per line:

{
  "title": "アンパサンド",
  "pageid": 5,
  "revid": 99347164,
  "url": "https://ja.wikipedia.org/api/rest_v1/page/html/%E3%82%A2%E3%83%B3%E3%83%91%E3%82%B5%E3%83%B3%E3%83%89/99347164",
  "html": "<!DOCTYPE html>\n<html prefix=\"dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/\" ..."
}

Extract paragraphs from the Wikipedia page HTMLs

This script extracts paragraph texts from a Wikipedia page HTMLs file generated by get_page_htmls.py. You can specify the minimum and maximum length of the paragraph texts to be extracted.

# This produces 10,144,171 paragraphs
$ python extract_paragraphs_from_page_htmls.py \
--page_htmls_file ~/work/wikipedia-utils/20240401/page-htmls-jawiki-20240401.json.gz \
--output_file ~/work/wikipedia-utils/20240401/paragraphs-jawiki-20240401.json.gz \
--min_paragraph_length 10 \
--max_paragraph_length 1000

Make a plain text corpus of Wikipedia paragraph/page texts

This script produces a plain text corpus file from a paragraphs file generated by extract_paragraphs_from_page_htmls.py. You can optionally filter out disambiguation/sexual/violent pages from the output file by specifying the corresponding command line options.

Here we use mecab-ipadic-NEologd in splitting texts into sentences so that some sort of named entities will not be split into sentences.

The output file is a gzipped text file containing one sentence per line, with the pages separated by blank lines.

# 25,529,795 lines from all pages
$ python make_corpus_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20240401/paragraphs-jawiki-20240401.json.gz \
--output_file ~/work/wikipedia-utils/20240401/corpus-jawiki-20240401.txt.gz \
--mecab_option '-d /usr/local/lib/mecab/dic/ipadic-neologd-v0.0.7' \
--min_sentence_length 10 \
--max_sentence_length 1000

# 20,555,941 lines from filtered pages
$ python make_corpus_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20240401/paragraphs-jawiki-20240401.json.gz \
--output_file ~/work/wikipedia-utils/20240401/corpus-jawiki-20240401-filtered-large.txt.gz \
--mecab_option '-d /usr/local/lib/mecab/dic/ipadic-neologd-v0.0.7' \
--min_sentence_length 10 \
--max_sentence_length 1000 \
--page_ids_file ~/work/wikipedia-utils/20240401/page-ids-jawiki-20240401.json \
--min_inlinks 10 \
--exclude_sexual_pages

This script produces a plain text corpus file by simply taking the text attributes of pages from a Wikipedia Cirrussearch dump file.

The resulting corpus file will be somewhat different from the one generated by make_corpus_from_paragraphs.py due to some differences in text processing. In addition, since the text attributes in the Cirrussearch dump file does not retain the page structure, it is less flexible to modify the processing of text compared to processing an HTML file with make_corpus_from_paragraphs.py.

$ python make_corpus_from_cirrussearch.py \
--cirrus_file ~/data/wikipedia/cirrussearch/20240401/jawiki-20240401-cirrussearch-content.json.gz \
--output_file ~/work/wikipedia-utils/20240401/corpus-jawiki-20240401-cirrus.txt.gz \
--min_inlinks 10 \
--exclude_sexual_pages \
--mecab_option '-d /usr/local/lib/mecab/dic/ipadic-neologd-v0.0.7'

Make a passages file from extracted paragraphs

This script takes a paragraphs file generated by extract_paragraphs_from_page_htmls.py and splits the paragraph texts into a collection of pieces of texts called passages (sections/paragraphs/sentences).

It is useful for creating texts of a reasonable length that can be handled by passage-retrieval systems such as DPR.

# Construct each passage from a paragraph not exceeding 400 chars
# 9,856,972 passages
$ python make_passages_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20240401/paragraphs-jawiki-20240401.json.gz \
--output_file ~/work/wikipedia-utils/20240401/passages-para-jawiki-20240401.json.gz \
--passage_unit paragraph \
--passage_boundary section \
--max_passage_length 400

# Construct passages from consecutive sentences within a section
# The sentences are joined to form a passage not exceeding 400 chars
# 5,807,053 passages
$ python make_passages_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20240401/paragraphs-jawiki-20240401.json.gz \
--output_file ~/work/wikipedia-utils/20240401/passages-c400-jawiki-20240401.json.gz \
--passage_unit sentence \
--passage_boundary section \
--max_passage_length 400 \
--as_long_as_possible

# Construct passages from consecutive sentences within a section
# The sentences are joined to form a passage not exceeding 300 chars
# 6,947,948 passages
$ python make_passages_from_paragraphs.py \
--paragraphs_file ~/work/wikipedia-utils/20240401/paragraphs-jawiki-20240401.json.gz \
--output_file ~/work/wikipedia-utils/20240401/passages-c300-jawiki-20240401.json.gz \
--passage_unit sentence \
--passage_boundary section \
--max_passage_length 300 \
--as_long_as_possible

Build Elasticsearch indices of Wikipedia passages/pages

Requirements

  • Elasticsearch 7.x with several plugins installed
# For running build_es_index_passages.py
$ ./bin/elasticsearch-plugin install analysis-icu
$ ./bin/elasticsearch-plugin install analysis-kuromoji

# For running build_es_index_cirrussearch.py (Elasticsearch 7.10.2 is needed)
# See https://mvnrepository.com/artifact/org.wikimedia.search/extra for alternative versions
$ ./bin/elasticsearch-plugin install org.wikimedia.search:extra:7.10.2-wmf1

This script builds an Elasticsearch index of passages generated by make_passages_from_paragraphs.

$ python build_es_index_passages.py \
--passages_file ~/work/wikipedia-utils/20240401/passages-para-jawiki-20240401.json.gz \
--page_ids_file ~/work/wikipedia-utils/20240401/page-ids-jawiki-20240401.json \
--index_name jawiki-20240401-para

$ python build_es_index_passages.py \
--passages_file ~/work/wikipedia-utils/20240401/passages-c400-jawiki-20240401.json.gz \
--page_ids_file ~/work/wikipedia-utils/20240401/page-ids-jawiki-20240401.json \
--index_name jawiki-20240401-c400

$ python build_es_index_passages.py \
--passages_file ~/work/wikipedia-utils/20240401/passages-c300-jawiki-20240401.json.gz \
--page_ids_file ~/work/wikipedia-utils/20240401/page-ids-jawiki-20240401.json \
--index_name jawiki-20240401-c300

This script builds an Elasticsearch index of Wikipedia pages using a Cirrussearch dump file. Cirrussearch dump files are originally for Elasticsearch bulk indexing, so this script simply takes the page information from the dump file to build an index.

$ python build_es_index_cirrussearch.py \
--cirrus_file ~/data/wikipedia/cirrussearch/20240401/jawiki-20240401-cirrussearch-content.json.gz \
--index_name jawiki-20240401-cirrus \
--language ja

License

The content of Wikipedia, which can be obtained with the codes in this repository, is licensed under the CC-BY-SA 3.0 and GFDL licenses.

The codes in this repository are licensed under the Apache License 2.0.

About

Utility scripts for preprocessing Wikipedia texts for NLP

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages