A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

Last update: Jul 11, 2022

Overview

tfds-korean

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다.

Dataset Catalog | pypi

Usage

Installation

pip install tfds-korean

Loading dataset

import tensorflow_datasets as tfds
import tfds_korean.nsmc # register nsmc dataset

ds = tfds.load('nsmc')

train_ds = ds['train'].batch(32)
test_ds = ds['test'].batch(128)

# define model
# ....
# ....

model.fit(train_ds)
model.evaluate(test_ds)

See Dataset Catalog page for dataset list and details of each dataset.

Examples

Licenses

The license for this repository and licenses for datasets are applied separately. It is recommended to use each dataset after checking the dataset's website.

본 레포지토리의 라이선스와 데이터셋의 라이선스는 별도로 적용됩니다. 데이터셋을 사용하기 전 각 데이터셋의 라이선스와 웹 사이트를 확인 후 사용하시길 권해드리며, 본 라이브러리는 데이터셋을 호스팅하거나 배포하지 않는 점을 참고부탁드립니다.

Comments

[Dataset Request] sae4k
Dataset Information

Dataset Name:

Prefered code name(e.g. korean_chatbot_qa_data): sae4k

Dataset description:

Homepage: https://github.com/warnikchow/sae4k

Citation:

Additional Context
dataset request
opened by jeongukjae 2
[Dataset Request] namuwiki corpus
Dataset Information

Dataset Name: namuwiki corpus

Prefered code name(e.g. korean_chatbot_qa_data):

Dataset description:

Homepage: https://github.com/jeongukjae/namuwiki-corpus

Citation:

License:

Additional Context

문장 단위 분절해놓은 나무위키 코퍼스
dataset request
opened by jeongukjae 1
[Dataset Request] korean wikipedia corpus
Dataset Information

Dataset Name: 한국어 위키피디아 코퍼스

Prefered code name(e.g. korean_chatbot_qa_data): korean_wikipedia_corpus

Dataset description:

Homepage: https://github.com/jeongukjae/korean-wikipedia-corpus

Citation:

License:

Additional Context

kowikitext도 충분히 좋지만, 문장단위로 사용할 때 불편한 점이 있다. 그래서 문장단위로 이미 나누어진 말뭉치를 한국어 위키피디아 덤프에서 하나 생성. (kss로 분절)

FeaturesDict({ 'content': Sequence(Text(shape=(), dtype=tf.string)), 'title': Text(shape=(), dtype=tf.string), })

요런식으로 content가 TensorSpec(shape=[None], dtype=tf.string)인 텐서값을 가지도록 만들어주면 distillation이나 문장 단위 unsupervised learning이나 할 때 편할 것 같아요.
dataset request before-release
opened by jeongukjae 1
[Dataset Request] KLUE
Dataset Information

Dataset Name: KLUE

Prefered code name(e.g. korean_chatbot_qa_data): klue_dp, klue_mrc, ...

Dataset description:

Homepage:

Citation:

License:

Additional Context

https://github.com/KLUE-benchmark/KLUE https://arxiv.org/pdf/2105.09680v1.pdf

[x] dp @jeongukjae

[x] mrc @harrydrippin

[x] ner @jeongukjae

[x] nli @jeongukjae

[x] re @jeongukjae

[x] sts @jeongukjae

[x] wos @jeongukjae

[x] ynat @jeongukjae

dataset request before-release
opened by jeongukjae 1
[Dataset Request] namuwikitext
Dataset Information

Dataset Name: Wikitext format dataset of Namuwiki

Prefered code name(e.g. korean_chatbot_qa_data): namuwikitext

Dataset description: 나무위키의 덤프 데이터를 바탕을 제작한 wikitext 형식의 텍스트 파일입니다. 학습 및 평가를 위하여 위키페이지 별로 train (99%), dev (0.5%), test (0.5%) 로 나뉘어져있습니다.

Homepage: https://github.com/lovit/namuwikitext

Citation:

Additional Context

https://github.com/lovit/namuwikitext/issues/10

README에 있는 데이터셋 개수와 맞지 않아 이렇게 이슈 작성을 해놓았는데, 답변은 없는 상황임. 일단 Korpora에 있는 대로 추가해놓고 나중에 다시 수정하는 것이 좋지 않을까
dataset request
opened by jeongukjae 1
[Dataset Request] KorQuAD
Dataset Information

Dataset Name: KorQuAD 1.0

Prefered code name(e.g. korean_chatbot_qa_data): korquad_10

Dataset description: KorQuAD 1.0은 한국어 Machine Reading Comprehension을 위해 만든 데이터셋입니다. 모든 질의에 대한 답변은 해당 Wikipedia article 문단의 일부 하위 영역으로 이루어집니다. Stanford Question Answering Dataset(SQuAD) v1.0과 동일한 방식으로 구성되었습니다.

Homepage: https://korquad.github.io/KorQuad%201.0/

Citation:

Dataset Information

Dataset Name: KorQuAD 2.0

Prefered code name(e.g. korean_chatbot_qa_data): korquad_20

Dataset description: KorQuAD 2.0은 KorQuAD 1.0에서 질문답변 20,000+ 쌍을 포함하여 총 100,000+ 쌍으로 구성된 한국어 Machine Reading Comprehension 데이터셋 입니다. KorQuAD 1.0과는 다르게 1~2 문단이 아닌 Wikipedia article 전체에서 답을 찾아야 합니다. 매우 긴 문서들이 있기 때문에 탐색 시간에 대한 고려가 필요할 것 입니다. 또한 표와 리스트도 포함되어 있기 때문에 HTML tag를 통한 문서의 구조 이해도 필요합니다. 이 데이터셋을 통해서 다양한 형태와 길이의 문서들에서도 기계독해가 가능해질 것 입니다.

Homepage: https://korquad.github.io

Citation:

Additional Context

일단은 KorQuAD 1.0만 추가해놓고 2.0은 추후에 추가해도 무방할 듯
dataset request before-release
opened by jeongukjae 1
[Dataset Request] 한국해양대학교 NER 데이터셋
Dataset Information

Dataset Name: 한국해양대학교 자연언어처리 연구실 NER 데이터셋

Prefered code name(e.g. korean_chatbot_qa_data): kmounlp_ner

Dataset description: 한국어 개체명 정의 및 표지 표준화 기술보고서와 이를 기반으로 제작된 개체명 형태소 말뭉치

Homepage: https://github.com/kmounlp/NER

Citation:

Additional Context

보고서: https://github.com/kmounlp/NER/blob/master/NER%20Guideline%20(ver%201.0).pdf
dataset request
opened by jeongukjae 1
Add CONTRIBUTING.md
[ ] 프로젝트에서 사용하는 언어에 대한 설명. 사용법/데이터셋 설명은 가능하면 영어로 적되, 이슈/PR 소통은 한국어로 하는게 좋지 않을까?

[ ] 데이터셋 추가하는 법

[ ] 이슈/PR/Discussion 간단한 설명

[ ] 추가로 같이 관리하고 싶은 분들에 대한 설명

[ ] 데이터셋 라이선스에 대한 문제에 대한 설명

documentation before-release
opened by jeongukjae 1
현재 wikitext의 문제점을 카탈로그에 적어두기

https://github.com/jeongukjae/tfds-korean/issues/12#issuecomment-826358469

위와 같은 이유로 "필터를 해서 사용해라" 혹은 "중간에 빈 example이 있다" 정도는 적어두는 편이 좋은 듯
documentation

opened by jeongukjae 0
[Dataset Request] sci-news-sum-kr-50
Dataset Information

Dataset Name:

Prefered code name(e.g. korean_chatbot_qa_data): sci_news_sum_kr_50

Dataset description:

Homepage: https://github.com/theeluwin/sci-news-sum-kr-50

Citation:

Additional Context
dataset request
opened by jeongukjae 0
[Dataset Request] kowikitext
Dataset Information

Dataset Name: 한국어 wikitext

Prefered code name(e.g. korean_chatbot_qa_data): kowikitext

Dataset description: Wikitext format Korean corpus

Homepage: https://github.com/lovit/kowikitext

Citation:

Additional Context

이것도 #12 와 같은 문제점이 존재하는 것으로 보이는데, 일단은 Korpora 방식을 따라감. 이 데이터셋도 heading을 기준으로 split할 경우 = 분류~~~ =같은 행들이 존재하여 정확히 문서 단위로 복구가 불가능함.
dataset request
opened by jeongukjae 0
[Dataset Request] korean_unsmile_dataset
Dataset Information

Dataset Name:

Prefered code name(e.g. korean_chatbot_qa_data):

Dataset description:

Homepage: https://github.com/smilegate-ai/korean_unsmile_dataset

Citation:

License:

Additional Context
dataset request
opened by jeongukjae 0
데이터셋 카탈로그 빌더 특정 데이터셋 스킵가능하게 수정

현재 모든 데이터셋이 로컬에 존재해야 카탈로그를 빌드할 수 있는데, 이게 너무 부담이 된다. 현재 develop 기준만 해도 대략 30GB를 로컬에 들고 있어야 한다.

데이터셋 버전이 바뀌지 않는다면 카탈로그를 다시 빌드해야하는 때는 build_catalog.py 스크립트가 변경될 때 뿐이라서 특정 데이터셋 페이지 & index 페이지만 빌드해도 되도록 수정해두자. 물론 전체 데이터셋에 대한 카탈로그 빌드도 가능하게 유지.
documentation

opened by jeongukjae 0
[Dataset Request] Korean Single Speaker Speech Dataset
Dataset Information

Dataset Name: Korean Single Speaker Speech Dataset

Prefered code name(e.g. korean_chatbot_qa_data):

Dataset description:

Homepage: https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset

Citation:

License:

Additional Context
dataset request
opened by jeongukjae 0
[Dataset Request] 세종코퍼스
Dataset Information

Dataset Name:

Prefered code name(e.g. korean_chatbot_qa_data): sejong_corpus

Dataset description:

Homepage: https://ithub.korean.go.kr/user/total/database/corpusManager.do

Citation:

License:

Additional Context

세종 코퍼스: https://ithub.korean.go.kr/user/total/database/corpusManager.do 세종 코퍼스 - 병렬: https://ithub.korean.go.kr/user/total/database/etcManager.do

라이선스가 상업적 이용이 어렵더라도 이용하기에 좋은 말뭉치라 생각해서 일단은 추가하는 게 좋을 것 같아요.
dataset request
opened by jeongukjae 0
[Dataset Request] kcbert
Dataset Information

Dataset Name:

Prefered code name(e.g. korean_chatbot_qa_data): kcbert

Dataset description:

Homepage: https://github.com/Beomi/KcBERT

Citation:

Additional Context

이거 추가해두면 엄청 유용하게 쓸 수 있을 것 같다!!
dataset request
opened by jeongukjae 4
[Dataset Request] KAIST Corpus
Dataset Information

Dataset Name: kaist corpus

Prefered code name(e.g. korean_chatbot_qa_data): kaist_corpus

Dataset description:

Homepage: http://semanticweb.kaist.ac.kr/home/index.php/KAIST_Corpus

Citation:

Additional Context
wontfix dataset request
opened by jeongukjae 1

Releases(0.4.0)

0.4.0(Sep 19, 2021)
Update KLUE dataset to 1.1.0 https://github.com/jeongukjae/tfds-korean/commit/e954ec4550ec5db015d3f93750e6763aca5a9b48

Reorder ClassLabel names of NLI datasets. https://github.com/jeongukjae/tfds-korean/commit/be3e8cba7b9d537969b9c08738dd6df36b0145bc

Source code(tar.gz)
Source code(zip)
0.3.0(Jun 16, 2021)
add korean_wikipedia_corpus (https://jeongukjae.github.io/tfds-korean/datasets/korean_wikipedia_corpus.html)

add namuwiki_corpus (https://jeongukjae.github.io/tfds-korean/datasets/namuwiki_corpus.html)

Source code(tar.gz)
Source code(zip)
0.2.0(Jun 6, 2021)
add KLUE benchmark datasets

update dataset catalog (https://github.com/jeongukjae/tfds-korean/commit/eb1c72d0a716aba7326276e77e8e6f94976bb579, https://github.com/jeongukjae/tfds-korean/commit/614616b82d0bbdaecbc4ec50e0cfc67b78b646c2)

fix klue_ner supervised key bug (https://github.com/jeongukjae/tfds-korean/commit/10f765f01b9f3952e298395779dcf8efeefde93a)

Source code(tar.gz)
Source code(zip)
0.1.3(May 29, 2021)
add klue_ner

add korquad 1.0 & 2.1

Source code(tar.gz)
Source code(zip)
0.1.2(May 25, 2021)
add sae4k dataset #18

fix #19

Source code(tar.gz)
Source code(zip)
0.1.1(Apr 30, 2021)
add sci_news_sum_kr50

add petitions_archive

Source code(tar.gz)
Source code(zip)
0.1.0(Apr 29, 2021)
Add kowikitext and namuwikitext dataset

Add missing licenses and bibtex.

Add license section in catalog page.

Add example links in catalog page.

Source code(tar.gz)
Source code(zip)

Owner

Jeong Ukjae

Machine Learning Engineer

GitHub Repository https://jeongukjae.github.io/tfds-korean/

pyMorfologik MorfologikpyMorfologik - Python binding for Morfologik.

Python binding for Morfologik Morfologik is Polish morphological analyzer. For more information see http://github.com/morfologik/morfologik-stemming/

18 Dec 29, 2021

A look-ahead multi-entity Transformer for modeling coordinated agents.

baller2vec++ This is the repository for the paper: Michael A. Alcorn and Anh Nguyen. baller2vec++: A Look-Ahead Multi-Entity Transformer For Modeling

30 Dec 16, 2022

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)

Structured Super Lottery Tickets in BERT This repo contains our codes for the paper "Super Tickets in Pre-Trained Language Models: From Model Compress

16 Dec 11, 2022

Turkish Stop Words Türkçe Dolgu Sözcükleri

trstop Turkish Stop Words Türkçe Dolgu Sözcükleri In this repository I put Turkish stop words that is contained in the first 10 thousand words with th

103 Nov 12, 2022

I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive

I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others

1 Jan 13, 2022

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

4.6k Jan 01, 2023

CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

- 基于标题的大规模商品实体检索top1 一、任务介绍 CCKS 2020：基于标题的大规模商品实体检索，任务为对于给定的一个商品标题，参赛系统需要匹配到该标题在给定商品库中的对应商品实体。输入：输入文件包括若干行商品标题。输出：输出文本每一行包括此标题对应的商品实体，即给定知识库中商品 ID，

43 Nov 11, 2022

BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

303 Dec 17, 2022

使用pytorch+transformers复现了SimCSE论文中的有监督训练和无监督训练方法

SimCSE复现项目描述 SimCSE是一种简单但是很巧妙的NLP对比学习方法，创新性地引入Dropout的方式，对样本添加噪声，从而达到对正样本增强的目的。该框架的训练目的为：对于batch中的每个样本，拉近其与正样本之间的距离，拉远其与负样本之间的距离，使得模型能够在大规模无监督语料（也可以

58 Dec 20, 2022

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing 🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉 If you

2.3k Dec 29, 2022

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

artificial intelligence cosmic love and attention fire in the sky a pyramid made of ice a lonely house in the woods marriage in the mountains lantern

2.3k Jan 01, 2023

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in

241 Jan 04, 2023

Text Classification in Turkish Texts with Bert

You can watch the details of the project on my youtube channel Project Interface Project Second Interface Goal= Correctly guessing the classification

42 Dec 31, 2022

تولید اسم های رندوم فینگیلیش

karafs کرفس تولید اسم های رندوم فینگیلیش installation ➜ pip install karafs usage دو زبانه ➜ karafs -n 10 توت فرنگی بی ناموس toot farangi-ye bi_namoos

36 Nov 24, 2022

GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network.

1.2k Jan 06, 2023

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

Related tags

Overview

tfds-korean

Usage

Installation

Loading dataset

Examples

Licenses

Comments

Dataset Information

Additional Context

Dataset Information

Additional Context

Dataset Information

Additional Context

Dataset Information

Additional Context

Dataset Information

Additional Context

Dataset Information

Dataset Information

Additional Context

Dataset Information

Additional Context

Dataset Information

Additional Context

Dataset Information

Additional Context

Dataset Information

Additional Context

Dataset Information

Additional Context

Dataset Information

Additional Context

Dataset Information

Additional Context

Dataset Information

Additional Context

Releases(0.4.0)

0.4.0(Sep 19, 2021)

0.3.0(Jun 16, 2021)

0.2.0(Jun 6, 2021)

0.1.3(May 29, 2021)

0.1.2(May 25, 2021)

0.1.1(Apr 30, 2021)

0.1.0(Apr 29, 2021)

Owner

Jeong Ukjae

pyMorfologik MorfologikpyMorfologik - Python binding for Morfologik.

A look-ahead multi-entity Transformer for modeling coordinated agents.

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)

Turkish Stop Words Türkçe Dolgu Sözcükleri

I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

使用pytorch+transformers复现了SimCSE论文中的有监督训练和无监督训练方法

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

Text Classification in Turkish Texts with Bert

تولید اسم های رندوم فینگیلیش

GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

Long text token classification using LongFormer

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

结巴中文分词

基于pytorch_rnn的古诗词生成