A tool for extracting plain text from Wikipedia dumps

Overview

WikiExtractor

WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump.

The tool is written in Python and requires Python 3 but no additional library. Warning: problems have been reported on Windows due to poor support for StringIO in the Python implementation on Windows.

For further information, see the Wiki.

Wikipedia Cirrus Extractor

cirrus-extractor.py is a version of the script that performs extraction from a Wikipedia Cirrus dump. Cirrus dumps contain text with already expanded templates.

Cirrus dumps are available at: cirrussearch.

Details

WikiExtractor performs template expansion by preprocessing the whole dump and extracting template definitions.

In order to speed up processing:

  • multiprocessing is used for dealing with articles in parallel
  • a cache is kept of parsed templates (only useful for repeated extractions).

Installation

The script may be invoked directly:

python -m wikiextractor.WikiExtractor <Wikipedia dump file>

It can also be installed from PyPi by doing:

pip install wikiextractor

or locally with:

(sudo) python setup.py install

The installer also installs two scripts for direct invocation:

wikiextractor  	(equivalent to python -m wikiextractor.WikiExtractor)
extractPage		(to extract a single page from a dump)

Usage

Wikiextractor

The script is invoked with a Wikipedia dump file as an argument:

python -m wikiextractor.WikiExtractor <Wikipedia dump file> [--templates <extracted template file>]

The option --templates extracts the templates to a local file, which can be reloaded to reduce the time to perform extraction.

The output is stored in several files of similar size in a given directory. Each file will contains several documents in this document format.

usage: wikiextractor [-h] [-o OUTPUT] [-b n[KMG]] [-c] [--json] [--html] [-l] [-ns ns1,ns2]
			 [--templates TEMPLATES] [--no-templates] [--html-safe HTML_SAFE] [--processes PROCESSES]
			 [-q] [--debug] [-a] [-v]
			 input

Wikipedia Extractor:
Extracts and cleans text from a Wikipedia database dump and stores output in a
number of files of similar size in a given directory.
Each file will contain several documents in the format:

	<doc id="" url="" title="">
	    ...
	    </doc>

If the program is invoked with the --json flag, then each file will                                            
contain several documents formatted as json ojects, one per line, with                                         
the following structure

	{"id": "", "revid": "", "url": "", "title": "", "text": "..."}

The program performs template expansion by preprocesssng the whole dump and
collecting template definitions.

positional arguments:
  input                 XML wiki dump file

optional arguments:
  -h, --help            show this help message and exit
  --processes PROCESSES
			    Number of processes to use (default 79)

Output:
  -o OUTPUT, --output OUTPUT
			    directory for extracted files (or '-' for dumping to stdout)
  -b n[KMG], --bytes n[KMG]
			    maximum bytes per output file (default 1M)
  -c, --compress        compress output files using bzip
  --json                write output in json format instead of the default <doc> format

Processing:
  --html                produce HTML output, subsumes --links
  -l, --links           preserve links
  -ns ns1,ns2, --namespaces ns1,ns2
			    accepted namespaces
  --templates TEMPLATES
			    use or create file containing templates
  --no-templates        Do not expand templates
  --html-safe HTML_SAFE
			    use to produce HTML safe output within <doc>...</doc>

Special:
  -q, --quiet           suppress reporting progress info
  --debug               print debug info
  -a, --article         analyze a file containing a single article (debug option)
  -v, --version         print program version

Saving templates to a file will speed up performing extraction the next time, assuming template definitions have not changed.

Option --no-templates significantly speeds up the extractor, avoiding the cost of expanding MediaWiki templates.

For further information, visit the documentation.

Cirrus Extractor

usage: cirrus-extract.py [-h] [-o OUTPUT] [-b n[KMG]] [-c] [-ns ns1,ns2] [-q]
                         [-v]
                         input

Wikipedia Cirrus Extractor:
Extracts and cleans text from a Wikipedia Cirrus dump and stores output in a
number of files of similar size in a given directory.
Each file will contain several documents in the format:

	<doc id="" url="" title="" language="" revision="">
        ...
        </doc>

positional arguments:
  input                 Cirrus Json wiki dump file

optional arguments:
  -h, --help            show this help message and exit

Output:
  -o OUTPUT, --output OUTPUT
                        directory for extracted files (or '-' for dumping to
                        stdin)
  -b n[KMG], --bytes n[KMG]
                        maximum bytes per output file (default 1M)
  -c, --compress        compress output files using bzip

Processing:
  -ns ns1,ns2, --namespaces ns1,ns2
                        accepted namespaces

Special:
  -q, --quiet           suppress reporting progress info
  -v, --version         print program version

extractPage

Extract a single page from a Wikipedia dump file.

usage: extractPage [-h] [--id ID] [--template] [-v] input

Wikipedia Page Extractor:
Extracts a single page from a Wikipedia dump file.

positional arguments:
  input          XML wiki dump file

optional arguments:
  -h, --help     show this help message and exit
  --id ID        article number
  --template     template number
  -v, --version  print program version

License

The code is made available under the GNU Affero General Public License v3.0.

Reference

If you find this code useful, please refer it in publications as:

@misc{Wikiextractor2015,
  author = {Giusepppe Attardi},
  title = {WikiExtractor},
  year = {2015},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/attardi/wikiextractor}}
}
Comments
  •  'maximum template recursion' error after a few hours

    'maximum template recursion' error after a few hours

    Can you explain why this error occurs? I used the updated version of the script uploaded yesterday. Now it's giving this error.

    Traceback (most recent call last): File "./WikiExtractor.py", line 1797, in main() File "./WikiExtractor.py", line 1793, in main process_data(input_file, args.templates, output_splitter) File "./WikiExtractor.py", line 1621, in process_data extract(id, title, page, output) File "./WikiExtractor.py", line 132, in extract text = clean(text) File "./WikiExtractor.py", line 1256, in clean text = expandTemplates(text) File "./WikiExtractor.py", line 307, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 808, in expandTemplate ret = expandTemplates(template, depth + 1) File "./WikiExtractor.py", line 307, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 769, in expandTemplate params = templateParams(parts[1:], depth) File "./WikiExtractor.py", line 396, in templateParams parameters = [expandTemplates(p, frame) for p in parameters] File "./WikiExtractor.py", line 307, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 769, in expandTemplate params = templateParams(parts[1:], depth) File "./WikiExtractor.py", line 396, in templateParams parameters = [expandTemplates(p, frame) for p in parameters] File "./WikiExtractor.py", line 307, in expandTemplates res += expandTemplate(text[s+2:e-2], depth+l) File "./WikiExtractor.py", line 808, in expandTemplate ret = expandTemplates(template, depth + 1) File "./WikiExtractor.py", line 313, in expandTemplates res += text[cur:] MemoryError

    opened by agoyaliitk 15
  • Infinite recursion

    Infinite recursion

    Trying to parse the enwiki-20150304-pages-articles.xml.bz2 dump causes infinite recursion:

    Traceback (most recent call last):
      File "/usr/lib/python2.7/logging/__init__.py", line 851, in emit
    Traceback (most recent call last):
      File "./WikiExtractor.py", line 1708, in <module>
        main()
      File "./WikiExtractor.py", line 1704, in main
        process_data(input_file, args.templates, output_splitter)
      File "./WikiExtractor.py", line 1537, in process_data
        extract(id, title, page, output)
      File "./WikiExtractor.py", line 132, in extract
        text = clean(text)
      File "./WikiExtractor.py", line 1172, in clean
        text = expandTemplates(text)
      File "./WikiExtractor.py", line 317, in expandTemplates
        res += expandTemplate(text[s+2:e-2], frame)
      File "./WikiExtractor.py", line 730, in expandTemplate
        ret =  expandTemplates(template, frame)
      File "./WikiExtractor.py", line 317, in expandTemplates
        res += expandTemplate(text[s+2:e-2], frame)
      File "./WikiExtractor.py", line 699, in expandTemplate
        params = templateParams(parts[1:])
      File "./WikiExtractor.py", line 406, in templateParams
        parameters[i] = expandTemplates(p)
      File "./WikiExtractor.py", line 317, in expandTemplates
        res += expandTemplate(text[s+2:e-2], frame)
      File "./WikiExtractor.py", line 730, in expandTemplate
        ret =  expandTemplates(template, frame)
      File "./WikiExtractor.py", line 317, in expandTemplates
        res += expandTemplate(text[s+2:e-2], frame)
      File "./WikiExtractor.py", line 699, in expandTemplate
        params = templateParams(parts[1:])
      File "./WikiExtractor.py", line 406, in templateParams
        parameters[i] = expandTemplates(p)
      ...
      File "./WikiExtractor.py", line 317, in expandTemplates
        res += expandTemplate(text[s+2:e-2], frame)
      File "./WikiExtractor.py", line 699, in expandTemplate
        params = templateParams(parts[1:])
      File "./WikiExtractor.py", line 406, in templateParams
        parameters[i] = expandTemplates(p)
      File "./WikiExtractor.py", line 317, in expandTemplates
        res += expandTemplate(text[s+2:e-2], frame)
      File "./WikiExtractor.py", line 735, in expandTemplate
        + str(maxTemplateRecursionLevels))
      File "/usr/lib/python2.7/logging/__init__.py", line 1604, in warning
        root.warning(msg, *args, **kwargs)
      File "/usr/lib/python2.7/logging/__init__.py", line 1164, in warning
        self._log(WARNING, msg, args, **kwargs)
      File "/usr/lib/python2.7/logging/__init__.py", line 1271, in _log
        self.handle(record)
      File "/usr/lib/python2.7/logging/__init__.py", line 1281, in handle
        self.callHandlers(record)
      File "/usr/lib/python2.7/logging/__init__.py", line 1321, in callHandlers
        hdlr.handle(record)
      File "/usr/lib/python2.7/logging/__init__.py", line 749, in handle
        self.emit(record)
      File "/usr/lib/python2.7/logging/__init__.py", line 879, in emit
        self.handleError(record)
      File "/usr/lib/python2.7/logging/__init__.py", line 802, in handleError
        None, sys.stderr)
      File "/usr/lib/python2.7/traceback.py", line 125, in print_exception
        print_tb(tb, limit, file)
      File "/usr/lib/python2.7/traceback.py", line 69, in print_tb
        line = linecache.getline(filename, lineno, f.f_globals)
      File "/usr/lib/python2.7/linecache.py", line 14, in getline
        lines = getlines(filename, module_globals)
    RuntimeError: maximum recursion depth exceeded
    

    The last log entry is: INFO:root:967 Acute disseminated encephalomyelitis.

    opened by cifkao 15
  • Extracting links apparently broken

    Extracting links apparently broken

    Hello,

    I'm using WikiExtractor for an academic project and I need to extract the pages from WikiNews while keeping the links. My problem is that the script, when called with the -l option, removes links instead of preserving them.

    Take at example this news, titled Nobel Peace Prize awarded to Kenyan environmental activist. I download the latest dump, then I run the script as follows:

    ~/wikiextractor$ ./WikiExtractor.py -o extractedWithLinks -l enwikinews-latest-pages-meta-current.xml
    

    I look for the file containing the text of the page:

    ~/wikiextractor$ cd extractedWithLinks/
    ~/wikiextractor/extractedWithLinks$ grep -r "Nobel Peace Prize awarded to Kenyan environmental activist" .
    ./AA/wiki_00:<doc id="1637" url="https://en.wikinews.org/wiki?curid=1637" title="Nobel Peace Prize awarded to Kenyan environmental activist">
    ...
    

    If I look at the XML extracted by WikiExtractor it looks like this:

    <doc id="1637" url="https://en.wikinews.org/wiki?curid=1637" title="Nobel Peace Prize awarded to Kenyan environmental activist">
    Nobel Peace Prize awarded to Kenyan environmental activist
    
    Dr Maathai is a member of parliament in Kenya, the country's deputy environmental minister, and holds a in from the University of Nairobi. For seven years she was the director of the in Kenya, and is most known for founding the — a non-governmental organization dedicated to environmental conservation and protecting forests. Since its founding in 1997, the organization claims to have planted over 30 million trees, in the process employing thousands of women — offering them empowerment, education and even family planning.
    
    The GBM organises rural women in Kenya to participate in environmentally friendly activities such as reforestation; economically-conducive activities like eco-tourism and training in forestry and food processing; as well as community development. 
    ...
    </doc>
    

    As you can see, the first sentence of the page is missing:

    OSLO — The 2004 Nobel Peace Prize was awarded today to Dr Wangari Maathai from Kenya. She is the first African woman to win the Peace prize, and the 12th woman to win the prize since its inception in 1901. The Nobel committee cited "her contribution to sustainable development, democracy and peace" as the reasons for awarding the prize. It is the first Peace prize awarded to an environmentalist.

    And some of the links in the following sentences are missing as well. The extracted text is:

    Dr Maathai is a member of parliament in Kenya, the country's deputy environmental minister, and holds a in from the University of Nairobi. For seven years she was the director of the in Kenya, and is most known for founding the — a non-governmental organization dedicated to environmental conservation and protecting forests. ...

    While the original text reads (the missing links are in bold):

    Dr Maathai is a member of parliament in Kenya, the country's deputy environmental minister, and holds a Ph.D. in anatomy from the University of Nairobi. For seven years she was the director of the Red Cross in Kenya, and is most known for founding the Green Belt Movement — a non-governmental organization dedicated to environmental conservation and protecting forests.

    So: am I missing something in the configuration of WikiExtractor? Is it a bug? Are WikiNews dumps for some reason not supported, even if they should be identical in structure to the usual Wikipedia ones?

    opened by basaldella 12
  • NameError: global name 'templatePrefix' is not defined

    NameError: global name 'templatePrefix' is not defined

    What does mean this bug and why it occurs?

    N File "C:\Users\Crezary Wagner\Documents\GitHub\wikipedia-extractor\WikiExtractor.py", line 1896, in clean ameError: global name 'templatePrefix' is not defined

    opened by ChameleonRed 11
  • NameError: global name 'templatePrefix' is not defined

    NameError: global name 'templatePrefix' is not defined

    I encountered the problem after running WikiExtractor.py (with python 2.7 in Windows 8.1 x64) on an farsi wiki dump. Can you explain why this error occurs?

    python h:\wiki\WikiExtractor.py h:\wiki\fawiki-20150602-pages-articles.xml.bz2 -cb 5M -o h:\wiki\extracted --processes 1 INFO: Preprocessing 'h:\wiki\fawiki-20150602-pages-articles.xml.bz2' to collect template definitions: this may take some time. INFO: Preprocessed 100000 pages INFO: Preprocessed 200000 pages INFO: Preprocessed 300000 pages INFO: Preprocessed 400000 pages INFO: Preprocessed 500000 pages INFO: Preprocessed 600000 pages INFO: Preprocessed 700000 pages INFO: Preprocessed 800000 pages INFO: Preprocessed 900000 pages INFO: Preprocessed 1000000 pages INFO: Preprocessed 1100000 pages INFO: Preprocessed 1200000 pages INFO: Preprocessed 1300000 pages INFO: Preprocessed 1400000 pages INFO: Preprocessed 1500000 pages INFO: Preprocessed 1600000 pages INFO: Preprocessed 1700000 pages INFO: Preprocessed 1800000 pages INFO: Preprocessed 1900000 pages INFO: Preprocessed 2000000 pages INFO: Preprocessed 2100000 pages INFO: Preprocessed 2200000 pages INFO: Loaded 109314 templates in 685.3s INFO: Starting page extraction from h:\wiki\fawiki-20150602-pages-articles.xml.bz2. INFO: Using 1 extract processes. Process Process-2: Traceback (most recent call last): File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap self.run() File "C:\Python27\lib\multiprocessing\process.py", line 114, in run self._target(_self._args, *_self._kwargs) File "h:\wiki\WikiExtractor.py", line 2427, in extract_process Extractor(*job[:3]).extract(out) # (id, title, page) File "h:\wiki\WikiExtractor.py", line 423, in extract text = clean(self, text) File "h:\wiki\WikiExtractor.py", line 1896, in clean text = extractor.expandTemplates(text) File "h:\wiki\WikiExtractor.py", line 479, in expandTemplates res += wikitext[cur:s] + self.expandTemplate(wikitext[s+2:e-2]) File "h:\wiki\WikiExtractor.py", line 636, in expandTemplate title = fullyQualifiedTemplateTitle(title) File "h:\wiki\WikiExtractor.py", line 1121, in fullyQualifiedTemplateTitle return templatePrefix + ucfirst(templateTitle) NameError: global name 'templatePrefix' is not defined

    opened by ehsanasgarian 9
  • Documentation

    Documentation

    Hi there, great tool! It appears there is an error or lack of detail in the documentation here: http://medialab.di.unipi.it/wiki/Wikipedia_Extractor . Or maybe this is OS-specific (I'm on Ubuntu 16.04 LTS) or depends on the language version of the wiki (although it really shouldn't). I have used the line similar to the one shown there, on the command line:

    WikiExtractor.py -cb 250K -o extracted itwiki-latest-pages-articles.xml.bz2
    

    If I do this, I get the following error message:

    WikiExtractor.py: command not found
    

    I then did this, adding "python" at the beginning, but still using the "*.xml.bz2" file as input file (just with the frwiki dump):

    python WikiExtractor.py frwiki-latest-pages-articles.xml.bz2
    

    With this, I get the following error message:

    INFO: Loaded 0 templates in 0.0s
    INFO: Starting page extraction from frwiki-20160920-pages-articles-multistream.xml.bz2.
    INFO: Using 3 extract processes.
    INFO: Finished 3-process extraction of 0 articles in 0.0s (0.0 art/s)
    
    

    A folder "text" is created but then nothing happens. If I try to use python 3, I get another error message:

    Traceback (most recent call last):
      File "WikiExtractor.py", line 60, in <module>
        from cStringIO import StringIO
    ImportError: No module named 'cStringIO'
    

    However, then I used python 2.7, first extracted the "bz2" archive (using bzip2) and then used the "*.xml" file as input, like this:

    python WikiExtractor.py frwiki-latest-pages-articles.xml
    

    With this, all works just fine. It may be worth telling people this. It drove me mad for a while because I could not figure out what I was doing wrong. Partly, this may just be because I was calling this thing from the command line. But the extraction bit should not depend on this.

    Thanks!

    opened by christofs 8
  • Stopped making progress

    Stopped making progress

    I am processing the dump of 20160305.

    The script ran for about 20 hours and then just stopped making any further progress. I saw two Unix processes but they were both sleeping.

    The last few outputs were:

    WARNING: Template errors in article 'Shawn Matthias' (15299966): title(2) recursion(0, 0, 0)
    WARNING: Template errors in article 'Rainey Street Historic District (Austin, Texas)' (15301930): title(0) recursion(116, 0, 0)
    WARNING: Template errors in article 'Alfred Neuland' (15304281): title(2) recursion(0, 0, 0)
    WARNING: Template errors in article 'Humberto Mariles' (15305453): title(2) recursion(0, 0, 0)
    WARNING: Template errors in article 'Rubén Uriza' (15305737): title(2) recursion(0, 0, 0)
    WARNING: Template errors in article 'Santiago Ramírez' (15306967): title(2) recursion(0, 0, 0)
    
    opened by ghost 8
  • Sections names absent

    Sections names absent

    Hi! Some sections names, e.g. 'Bibliografia' are removed. For example, for this person Duino Gorin https://it.wikipedia.org/wiki/Duino_Gorin

    In XML file I could see level 2 header: ==Bibliografia== *''La Raccolta Completa degli Album Panini 1975-1976'' *''La Raccolta Completa degli Album Panini 1960-2004'' - Indici *''Almanacco illustrato del calcio 1982''. edizione Pani

    But in the processed file just ( no 'Bibliografia' section):

    Trascorse in rossonero tre stagioni, fino al 1977, quando passò al Monza.

    • "La Raccolta Completa degli Album Panini 1975-1976"
    • "La Raccolta Completa degli Album Panini 1960-2004" - Indici
    • "Almanacco illustrato del calcio 1982". edizione Panini.

    How could I keep sections' names, please?

    Thanks!

    opened by ghost 8
  • Template expansion does not seem to work for french

    Template expansion does not seem to work for french

    First get the template file as TEMPLATES, this requires parsing the whole file.

    python extractPage.py --id 275 ../frwiki-20150602-pages-articles.xml.bz2 >aikibudo python WikiExtractor.py -o extracted --templates ../TEMPLATES -a aikibudo

    I get

    L' est un art martial traditionnel d'origine japonaise ("budō") essentiellement basé sur des techniques de défense.

    Correct sentence

    L'aïkibudo (合気武道, aikibudō?) est un art martial traditionnel d'origine japonaise (budō) essentiellement basé sur des techniques de défense.

    Wiki text :

    L'{{japonais|'''aïkibudo'''|合気武道|aikibudō}} est un [[art martial]] traditionnel d'origine [[japon]]aise (''[[budō]]'') essentiellement basé sur des techniques de défense.

    opened by aadant 8
  • wiki extractor results directories end up in QN

    wiki extractor results directories end up in QN

    I want to get the title and the content of every wikipedia articles. I found the wiki extractor to be very useful to this purpose. I use wiki extractor according to the instructions on the github. When running wiki extractor V2.8, I ran into 'maximum template recursion' error after a few hours. I am getting wiki extractor from this github webpage:https://github.com/bwbaugh/wikipedia-extractor/blob/master/WikiExtractor.py

    So I tried the previous commit/version. I tried both V2.6, V2.5 and V2.4.

    In wiki extractor V2.4, the program seems to be run successfully; the program stops after printing "45581241 Kaduthuruthy Thazhathupally" to the terminal; the resulting directory ranges from AA to QH.

    In wiki extractor V2.5, the program seems to be run successfully; the program stops after printing "45581241 Kaduthuruthy Thazhathupally" to the terminal; the resulting directory ranges from AA to QN.

    In wiki extractor V2.6, the program seems to be run successfully; the program stops after printing "45581241 Kaduthuruthy Thazhathupally" to the terminal; the resulting directory ranges from AA to QN.

    But I am really confused, because I have no idea which version has the complete wikipedia articles. In my understanding, it seems none of them succeed. Because in the resulting directory it should contain from AA to AZ, BA to BZ, ... QA to QZ, RA to RZ...ZA to ZZ. But in V2.5 and V2.6, it stops at QN.

    Could any one who run the wiki extractor successfully please shed some light on me? What should the successful result look like? And which version should I run to get the correct result?

    opened by sylvia1 8
  • Use of global variables

    Use of global variables

    There seems to be a lot of code in this library that uses global variables that are shared across multiple processes (for instance, the various parameters for tags to ignore, etc.). This is not a good practice and especially causes problems on Windows. It would be better to reorganize the code so that the global state is stored in an object that is passed to each process.

    I encountered this in the form of repeated NameError: global name 'discardElements' is not defined errors. This appears to be due to the recent PR #102 that moved the definition of this element inside a function, so it's no longer defined globally.

    opened by BrenBarn 7
  • How to extract lists pages?

    How to extract lists pages?

    There are many pages which are just list other pages, e1 , e2

    Running "vanilla" extraction omits them entirely while keeping only the title. What do I need to configure to extract those pages? bonus - what can I do in order to extract - only - those list pages?

    opened by katzurik 0
  • Non-textual elements score and mapframe are not filtered out

    Non-textual elements score and mapframe are not filtered out

    Several elements with non-textual content such as maps and musical scores (elements mapframe and score) are not filtered out. Steps to reproduce:

    1. Download this dump: https://dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2
    2. Invoke the following command to list lines that contain the opening tags of these elements: wikiextractor --no-templates --html-safe '' -o - dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2 | grep '<\(score\|mapframe\)\b'

    Output:

    <score sound="1"> % Adding least one space before each line is recommended
    <mapframe latitude="37.7235" longitude="23.4819" zoom="10" width="200" height="131" align="left" />Aegina is roughly triangular in shape, approximately from east to west and from north to south, with an area of .
    <score sound="1">{ \time 4/4 c'4 c' g' g' | a' a' g'2 | f'4 f' e' e' | d' d' c'2 | g'4 g' f' f' | e' e' d'2 | g'4 \times 2/3 { f'8 f' f' } e'4 d' | c' r r2 | \bar "|." } \addlyrics { A B C D E F G H I J K L M N O P Q R S T U V dub- a- U X Y "Z(ed)" }</score>
    <score sound="1">
    <score sound="1">
    <mapframe width=400 height=200 zoom=6 latitude=42.977 longitude=-76.506 align=left text="The original path of the Erie Canal">
    <mapframe latitude="37.807984" longitude="-122.475411" zoom="18" width="213" height="180" align="right">
    

    In general, the output also contains the content delimited by the tags (musical scores and map data). In some cases, both of the opening/closing tags (or parth the score itself) for musical scores are missing, e.g. article id="152940" from dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles3.xml-p151574p311329.bz2 contains only the opening <score>:

    Sheet music does not often explicitly indicate "Bebung". Composers assumed that, like other ornaments, performers would apply "bebung" at their discretion. Where sheet music does indicate "bebung", it appears as a series of dots above or below a note, with the number of dots indicating the number of finger movements. For example: <score>
    Carl Philipp Emanuel Bach called the vibrato "Bebung", however other composers like Johann Mattheson had described the term earlier on. C.P.E Bach often used Bebung in his 
    

    More often, we see the whole score with the closing tag, but no opening tag.

    There similar issues with other tags (#300) and table formatting (#298).

    opened by adno 0
  • Various tags such as q, br, ins, del are not fitered out

    Various tags such as q, br, ins, del are not fitered out

    Many elements/tags appear in wikiextractor's output, such as poem, q, ins, del, br, section, onlyinclude, includeonly, math or mathematical equations (with commands such as \mathbf) not enclosed in any tags.

    1. Download this dump: https://dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2
    2. Invoke the following command to list lines that contain the opening tags of these elements:

    wikiextractor --no-templates --html-safe '' -o - dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2 | grep '<\(poem\|q\|section\|ins\|del\|math\|onlyinclude\|br\|chem\)\b'

    Examples from the output:

    <poem>
    <poem style="margin-left:2em">
    <br>"domestic:" good automatic telephone system
    …
    Benzene, <chem>C6H6</chem>, …
    …
    <section end="Big Brother series" />
    …
    <onlyinclude>
    …
    <chem>O2{} + 4H+(aq){} + 4 Fe^{2+}(cyt\,c) -> 2H2O{} + 4 Fe^{3+}(cyt\,c) </chem> formula_1
    …
    </includeonly><section end=Lineups />
    

    (Not all of the tags appear in this particular dump.)

    opened by adno 1
  • Cannot turn off --html-safe command line option (true by default)

    Cannot turn off --html-safe command line option (true by default)

    Due to a bug, the only way to turn off the --html-safe command line option is passing an empty argument (that evaluates as false in Python) like this:

    wikiextractor --html-safe ""

    The following does not work :

    wikiextractor --no-html-safe wikiextractor --html-safe false

    The argument is currently defined like this:

    https://github.com/attardi/wikiextractor/blob/f0ca16c3e92983b9094b6f32526992fc3a678f8f/wikiextractor/WikiExtractor.py#L560-L561

    This means that any parameter is converted to string, and then evaluates as true unless empty. One simple way of correctly defining a boolean argument with default true value would be:

    parser.add_argument("--html-safe", default=True, action=argparse.BooleanOptionalAction,
                            help="use to produce HTML safe output within <doc>...</doc>")
    

    This way the parser would accept both --html-safe and --no-html-safe and also generate appropriate help.

    opened by adno 0
  • Tables are not entirely filtered out

    Tables are not entirely filtered out

    Many tables (or parts of them) are still in the output.

    Steps to reproduce:

    1. Download this dump: https://dumps.wikimedia.org/jawiki/20221020/jawiki-20221020-pages-articles1.xml-p1p114794.bz2
    2. Invoke the following command to list lines that contain the string "colspan": bzcat jawiki-20221020-pages-articles1.xml-p1p114794.bz2 | wikiextractor/WikiExtractor.py --no-templates -o - - | grep colspan

    Output:

    249||24||colspan="2"|-||9||0||258||24
    21||1||1||0||colspan="2"|-||22||1
    12||0||colspan="2"|-||colspan="2"|-||12||0
    4||2||colspan="2"|-||colspan="2"|-||4||2
    !rowspan="2"|通算!!colspan="2"|OFC
    !colspan="4"|FIFA
    !colspan="2" style="background-color:#efefef"|内容
    !colspan="3"|小計
    !colspan="3"|小計
    !colspan="3"|小計
    !colspan="3"|小計
    !colspan="3"|小計
    !colspan="3"|小計
    !colspan="3"|小計
    !colspan="3"|小計
     59 || 26 ||colspan="2"|-||colspan="2"|-|| 59 || 26
    !colspan="4"|日本!!colspan="2"|リーグ戦!!colspan="2"|!!colspan="2"|天皇杯!!colspan="2"|期間通算
    

    [shortened]

    opened by adno 0
Releases(v3.0.6)
Owner
Giuseppe Attardi
Giuseppe Attardi
Man-Userbot adalah userbot Telegram modular yang berjalan di Python3 dengan database sqlalchemy

Man-Userbot Telegram Man-Userbot adalah userbot Telegram modular yang berjalan di Python3 dengan database sqlalchemy. Berbasis Paperplane dan ProjectB

DzLyz 1 Feb 12, 2022
A simple Telegram bot that converts a phone number to a direct whatsapp chat link

Open in WhatsApp I was using a great app to open a whatsapp chat with a given number directly without saving that number in my contact list, but I fel

Pathfinder 19 Dec 24, 2022
This solution helps you deploy Data Lake Infrastructure on AWS using CDK Pipelines.

CDK Pipelines for Data Lake Infrastructure Deployment This solution helps you deploy data lake infrastructure on AWS using CDK Pipelines. This is base

AWS Samples 66 Nov 23, 2022
Library for working with QIWI API.

Library for working with QIWI API.

qxtony 2 Apr 26, 2022
Fastest Tiktok Username checker on site.

Tiktok Username Checker Fastest Tiktok Username checker on site

sql 3 Jun 19, 2021
Satoshi is a discord bot template in python using discord.py that allow you to track some live crypto prices with your own discord bot.

Satoshi ~ DiscordCryptoBot Satoshi is a simple python discord bot using discord.py that allow you to track your favorites cryptos prices with your own

Théo 2 Sep 15, 2022
A telegram smoot and high quality music player bot.

▪︎ Music Player ▪︎ A smooth telegram music bot with high quality songs ■ [Features] • Fast Starts streaming your inputs while downloading and converti

Simple Boy 3 Feb 05, 2022
A telegram bot that can send you high-quality audio 🎧🎧🎧

Music downloader bot Still under development Please Report issues to improve this repo.I will try to fix bugs in next update Music downloader bot is a

Anish Gowda 36 Dec 06, 2022
Telegram bot/scraper to get the latest NUS vacancy reports.

Telegram bot/scraper to get the latest NUS vacancy reports. Stay ahead of the curve and don't get modrekt.

Chee Hong 1 Jan 08, 2022
Acc-discord-rpc - Assetto Corsa Competizione Discord Rich Presence Client

A simple Assetto Corsa Competizione Rich Presence client. This app only works in

6 Dec 18, 2022
Simple debugger and tester for dico-command.

dp Simple debugger and tester for dico-command. Installation pip install -U dico-dp Usage bot = dico_command.Bot(...) ... bot.load_module("dp") Comma

3 Nov 19, 2022
Jika ada pertanyaan lebih lanjut, hubungi kontak dibawah ini. Terimakasih...

⚡ Lynx Userbot ⚡ Userbot Used for Fun on Telegram, and for Maintianing Your Group. This is a Repo Lynx-Userbot. This is Repo was Created by Axel From

29 Aug 30, 2021
Hydro Quebec API wrapper.

HydroQC Hydro Quebec API wrapper. This is a package to access some functionalities of Hydro Quebec API that are not documented. Documentation https://

Olivier BEAU 9 Dec 02, 2022
Simple Webhook Spammer with Optional Proxy Support

😎 �Simple Webhook Spammer with Optional Proxy Support:- [+] git clone https://g

Terminal1337 12 Sep 29, 2022
📢 Video Chat Stream Telegram Bot. Can ⏳ Stream Live Videos, Radios, YouTube Videos & Telegram Video Files On Your Video Chat Of Channels & Groups !

Telegram Video Chat Bot (Beta) 📢 Video Chat Stream Telegram Bot 🤖 Can Stream Live Videos, Radios, YouTube Videos & Telegram Video Files On Your Vide

brut✘⁶⁹ // ユスフ 15 Dec 24, 2022
Python linting made easy. Also a casual yet honorific way to address individuals who have entered an organization prior to you.

pysen What is pysen? pysen aims to provide a unified platform to configure and run day-to-day development tools. We envision the following scenarios i

Preferred Networks, Inc. 452 Jan 05, 2023
Public repo of the bot

wiki-reddit-bot Public repo of u/wikipedia_answer_bot Tools Language: Python Libraries: praw (Reddit API) mediawikiapi (Wikipedia API) tenacity How it

TheBugYouCantFix 51 Dec 03, 2022
JAWS Pankration 2021 - DDD on AWS Lambda sample

JAWS Pankration 2021 - DDD on AWS Lambda sample What is this project? This project contains sample code for AWS Lambda with domain models. I presented

Atsushi Fukui 21 Mar 30, 2022
Weather App using openweathermap API

This is my hobby project used to learn how to use public api for project.In this i used the api of openweathermap to featch the weather details of various city across the globe by giving city name as

Subramanya K S 1 Nov 06, 2021
ShoukoKomiRobot - An anime themed telegram bot that can convert telegram media

ShoukoKomiRobot • 𝕎𝕣𝕚𝕥𝕥𝕖𝕟 𝕀𝕟 Python3 • 𝕃𝕚𝕓𝕣𝕒𝕣𝕪 𝕌𝕤𝕖𝕕 Pyrogram

25 Aug 14, 2022