This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

Last update: Dec 13, 2022

Related tags

Overview

OpenWebText2

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

Very briefly, OpenWebText2 is a large filtered dataset of text documents scraped from URL found on Reddit submisisons.

The plug and play version of OpenWebText2 contains:

17,103,059 documents
65.86GB uncompressed text

Download Dataset / Documentation

For further information please visit our documentation.

Acknowledgements

researcher2 Wrote much of this code, with inspiration and some straight copying of the scraping code found here.
sdtblck kindly put together the Colab notebook, and performed a chunk of the scraping.
leogao2 provided overall design guidance, lm_dataformat, and performed another chunk of scraping.
Colaboratory VMs helped us with about 10% of our overall scraping.
The Eye host our processed datasets.
Read The Docs host our documentation.

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

8.4k Dec 26, 2022

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

7.5k Feb 17, 2021

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

The PyTorch-Kaldi Speech Recognition Toolkit PyTorch-Kaldi is an open-source repository for developing state-of-the-art DNN/HMM speech recognition sys

2.3k Dec 27, 2022

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

1 Jan 28, 2022

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

End-to-end neural table-text understanding models.

914 Jan 7, 2023

Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

fake-news-detector-1.0 Lists, lists and more lists... Spam filter list, quality keyword list, stoplist list, top-domains urls list, news agencies webs

1 Jan 4, 2022

Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

316 Jan 3, 2023

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

3.5k Dec 30, 2022

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

193 Jan 4, 2023

Comments

Fixing an issue with sha256 checking

The pushshift.pushshift_to_sqlite method passes the arguments to best_download.download_file in a wrong order, and the code crashes. Hence, the dataset is not reproducible without this modification.

opened by ardacihaner 0

Releases(v1.0)

v1.0(Aug 29, 2021)

Initial Release.
Source code(tar.gz)
Source code(zip)

Owner

EleutherAI

GitHub Repository

NLP-based analysis of poor Chinese movie reviews on Douban

douban_embedding 豆瓣中文影评差评分析 1. NLP NLP（Natural Language Processing）是指自然语言处理，他的目的是让计算机可以听懂人话。下面是我将2万条豆瓣影评训练之后，随意输入一段新影评交给神经网络，最终AI推断出的结果。 "很好，演技不错

3 Apr 15, 2022

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

797 Dec 26, 2022

Download videos from YouTube/Twitch/Twitter right in the Windows Explorer, without installing any shady shareware apps

youtube-dl and ffmpeg Windows Explorer Integration Download videos from YouTube/Twitch/Twitter and more (any platform that is supported by youtube-dl)

226 Dec 30, 2022

Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

Low-resource-Machine-Translation This repository contains the code for the project relative to the course Deep Natural Language Processing. The goal o

3 Jun 22, 2022

All the code I wrote for Overwatch-related projects that I still own the rights to.

overwatch_shit.zip This is (eventually) going to contain all the software I wrote during my five-year imprisonment stay playing Overwatch. I'll be add

2 Dec 31, 2021

CYGNUS, the Cynical AI, combines snarky responses with uncanny aggression.

New & (hopefully) Improved CYGNUS with several API updates, user updates, and online/offline operations added!!!

0 Mar 28, 2022

code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

6 Jun 04, 2021

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

Transformer Embedder A Word Level Transformer layer based on PyTorch and 🤗 Transformers. How to use Install the library from PyPI: pip install transf

27 Nov 20, 2022

The official repository of the ISBI 2022 KNIGHT Challenge

KNIGHT The official repository holding the data for the ISBI 2022 KNIGHT Challenge About The KNIGHT Challenge asks teams to develop models to classify

4 Jan 22, 2022

Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

Visualize, analyze, and explore NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BER

1.6k Dec 25, 2022

Code for Emergent Translation in Multi-Agent Communication

Emergent Translation in Multi-Agent Communication PyTorch implementation of the models described in the paper Emergent Translation in Multi-Agent Comm

75 Jul 15, 2022

Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

9 Jan 08, 2023

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

Related tags

Overview

OpenWebText2

Download Dataset / Documentation

Acknowledgements

You might also like...

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Comments

Fixing an issue with sha256 checking

Releases(v1.0)

v1.0(Aug 29, 2021)

Owner

EleutherAI

NLP-based analysis of poor Chinese movie reviews on Douban

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

Download videos from YouTube/Twitch/Twitter right in the Windows Explorer, without installing any shady shareware apps

Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

All the code I wrote for Overwatch-related projects that I still own the rights to.

CYGNUS, the Cynical AI, combines snarky responses with uncanny aggression.

code for modular summarization work published in ACL2021 by Krishna et al

A Word Level Transformer layer based on PyTorch and 🤗 Transformers.

The official repository of the ISBI 2022 KNIGHT Challenge

Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

Code for Emergent Translation in Multi-Agent Communication

Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

Open-source offline translation library written in Python. Uses OpenNMT for translations

Associated Repository for "Translation between Molecules and Natural Language"

Tool which allow you to detect and translate text.

Tool to check whether a GCP bucket is public or not.

Almost State-of-the-art Text Generation library

DVC-NLP-Simple-usecase

Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

PortaSpeech - PyTorch Implementation