Open clone of OpenAI's unreleased WebText dataset scraper.

Last update: Dec 30, 2022

Related tags

Overview

OpenWebText

Joshua Peterson, Stephan Meylan, & David Bourgin

Open clone of OpenAI's unreleased WebText dataset (blog, paper, code) scraper used to train GPT-2. The current result is just over 23 million URLs and over 10 million HTML pages.

This implementation mines and intelligently de-duplicates +3 karma URLs from pre-downloaded (monthly) pushshift.io Reddit submission dumps (which is much faster than making successive calls to the web API), downloads raw HTML, and extracts text. To save time, you can use the pre-filtered URL lists here, which reduce the 140GB of pushshift data to down to the 2GB of URLs actually needed for content scraping. There's also an initial utility for tokenizing and we are looking to add BPE encoding soon. This code base is functional but in active development so please feel free to post issues or suggest improvements (pull requests welcome).

Dependencies

If you use pipenv (pip install --user pipenv), cd to the project root and run

pipenv install 
pipenv shell

Otherwise, just run the following in a new virtual environment

pip3 install -r requirements.txt

To Extract/Clean URLs Yourself

You can download the pre-filtered URLs here, but if you want to re-filter them yourself, perhaps with different filtering criteria, follow these instructions. Pushshift dumps must first be downloaded using fetch_urls.py (thanks to simonfall), or manually from here. Two example dumps are included in the repo in the "pushshift_dumps" folder. Next, extract good URLs using:

python extract_urls.py --single_file RS_v2_2005-06.xz

To process multiple pushshift files, specify year ranges:

python extract_urls.py --year_start 2016 --year_end 2018

To change the karma threshold:

python extract_urls.py --single_file RS_v2_2005-06.xz --min_karma 4

To de-duplicate the extracted URLs, provide a directory of all URL dumps:

python deduplicate_urls.py --input_dir url_dumps

The output of both extract_urls.py and deduplicate_urls.py are text files given that all 23 million "good" URLs only comprise 2GB.

To Scrape HTML (or Text Directly)

This is done one month at a time given the compute/bandwidth required. n_procs is the number of cores to use for parallelization and should be at least 20-40 for fastest results. The script will output results in chunks of size chunk_size. If timeout is not set, or is set to -1, the downloader may hang on large files.

To scrape raw HTML for later processing and text extraction, set --scraper to raw as shown below. The downloaded HTML is stripped of script/style tags and stored in compressed archives using LZMA compression, along with a small amount of meta.

python download.py url_dumps_deduped/RS_20XX-XX.xz.deduped.txt --n_procs 100 --scraper raw --chunk_size 100000 --compress --timeout 30

To scrape text content directly and save disk space (but without the option to re-extract with different parameters later), set --scraper to newspaper to extract text using the Python newspaper package. For more careful extraction, set --scraper to bs4 (Beautiful Soup 4), which will extact text for all

tags on the page.

To Extract Text from HTML (After Download)

python extract_text.py --html_archive scraped/RS_20XX-XX-X_data.xz --n_procs 100

This currently uses newspaper and outputs txt files.

Tokenization

The original WebText didn't use tokenization, but if you need it use:

python tokenize_text.py --input_glob "parsed/*.txt" --output_dir tokenized

This will be improved and parallelized soon.

BPE Encoding

Coming soon...

Original OpenAI project links

Blog Post (Better Language Models and Their Implications)
Paper (Language Models are Unsupervised Multitask Learners)
Code (https://github.com/openai/gpt-2)

Other Implmentations

An alternative scraper based on the pushshift.io API and fork of the download code above can be found here

Open clone of OpenAI's unreleased WebText dataset scraper.

Related tags

Overview

OpenWebText

Dependencies

To Extract/Clean URLs Yourself

To Scrape HTML (or Text Directly)

To Extract Text from HTML (After Download)

Tokenization

BPE Encoding

Original OpenAI project links

Other Implmentations

Owner

Joshua C Peterson

Web Content Retrieval for Humans™

Convert HTML to Markdown-formatted text.

Brownant is a web data extracting framework.

Every web site provides APIs.

fast python port of arc90's readability tool, updated to match latest readability.js!

RSS feed generator website with user friendly interface

Module for automatic summarization of text documents and HTML pages.

Zotero2Readwise - A Python Library to retrieve annotations and notes from Zotero and upload them to your Readwise

Combine XPath, CSS Selectors and JSONPath for Web data extracting.

Pythonic HTML Parsing for Humans™

Web-Extractor - Simple Tool To Extract IP-Adress From Website

News, full-text, and article metadata extraction in Python 3. Advanced docs:

a small library for extracting rich content from urls

Github Actions采集RSS, 打造无广告内容优质的头版头条超赞宝藏页

Open clone of OpenAI's unreleased WebText dataset scraper.

Export your data from Xiami

Fast and robust date extraction from web pages, with Python or on the command-line