๐Ÿฆ… Pretrained BigBird Model for Korean (up to 4096 tokens)

Overview

Pretrained BigBird Model for Korean

What is BigBird โ€ข How to Use โ€ข Pretraining โ€ข Evaluation Result โ€ข Docs โ€ข Citation

ํ•œ๊ตญ์–ด | English

Apache 2.0 Issues linter DOI

What is BigBird?

BigBird: Transformers for Longer Sequences์—์„œ ์†Œ๊ฐœ๋œ sparse-attention ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋กœ, ์ผ๋ฐ˜์ ์ธ BERT๋ณด๋‹ค ๋” ๊ธด sequence๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿฆ… Longer Sequence - ์ตœ๋Œ€ 512๊ฐœ์˜ token์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” BERT์˜ 8๋ฐฐ์ธ ์ตœ๋Œ€ 4096๊ฐœ์˜ token์„ ๋‹ค๋ฃธ

โฑ๏ธ Computational Efficiency - Full attention์ด ์•„๋‹Œ Sparse Attention์„ ์ด์šฉํ•˜์—ฌ O(n2)์—์„œ O(n)์œผ๋กœ ๊ฐœ์„ 

How to Use

  • ๐Ÿค— Huggingface Hub์— ์—…๋กœ๋“œ๋œ ๋ชจ๋ธ์„ ๊ณง๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:)
  • ์ผ๋ถ€ ์ด์Šˆ๊ฐ€ ํ•ด๊ฒฐ๋œ transformers>=4.11.0 ์‚ฌ์šฉ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. (MRC ์ด์Šˆ ๊ด€๋ จ PR)
  • BigBirdTokenizer ๋Œ€์‹ ์— BertTokenizer ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. (AutoTokenizer ์‚ฌ์šฉ์‹œ BertTokenizer๊ฐ€ ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค.)
  • ์ž์„ธํ•œ ์‚ฌ์šฉ๋ฒ•์€ BigBird Tranformers documentation์„ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”.
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("monologg/kobigbird-bert-base")  # BigBirdModel
tokenizer = AutoTokenizer.from_pretrained("monologg/kobigbird-bert-base")  # BertTokenizer

Pretraining

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Pretraining BigBird] ์ฐธ๊ณ 

Hardware Max len LR Batch Train Step Warmup Step
KoBigBird-BERT-Base TPU v3-8 4096 1e-4 32 2M 20k
  • ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜, ํ•œ๊ตญ์–ด ์œ„ํ‚ค, Common Crawl, ๋‰ด์Šค ๋ฐ์ดํ„ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต
  • ITC (Internal Transformer Construction) ๋ชจ๋ธ๋กœ ํ•™์Šต (ITC vs ETC)

Evaluation Result

1. Short Sequence (<=512)

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Finetune on Short Sequence Dataset] ์ฐธ๊ณ 

NSMC
(acc)
KLUE-NLI
(acc)
KLUE-STS
(pearsonr)
Korquad 1.0
(em/f1)
KLUE MRC
(em/rouge-w)
KoELECTRA-Base-v3 91.13 86.87 93.14 85.66 / 93.94 59.54 / 65.64
KLUE-RoBERTa-Base 91.16 86.30 92.91 85.35 / 94.53 69.56 / 74.64
KoBigBird-BERT-Base 91.18 87.17 92.61 87.08 / 94.71 70.33 / 75.34

2. Long Sequence (>=1024)

์ž์„ธํ•œ ๋‚ด์šฉ์€ [Finetune on Long Sequence Dataset] ์ฐธ๊ณ 

TyDi QA
(em/f1)
Korquad 2.1
(em/f1)
Fake News
(f1)
Modu Sentiment
(f1-macro)
KLUE-RoBERTa-Base 76.80 / 78.58 55.44 / 73.02 95.20 42.61
KoBigBird-BERT-Base 79.13 / 81.30 67.77 / 82.03 98.85 45.42

Docs

Citation

KoBigBird๋ฅผ ์‚ฌ์šฉํ•˜์‹ ๋‹ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”.

@software{jangwon_park_2021_5654154,
  author       = {Jangwon Park and Donggyu Kim},
  title        = {KoBigBird: Pretrained BigBird Model for Korean},
  month        = nov,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.5654154},
  url          = {https://doi.org/10.5281/zenodo.5654154}
}

Contributors

Jangwon Park and Donggyu Kim

Acknowledgements

KoBigBird๋Š” Tensorflow Research Cloud (TFRC) ํ”„๋กœ๊ทธ๋žจ์˜ Cloud TPU ์ง€์›์œผ๋กœ ์ œ์ž‘๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ ๋ฉ‹์ง„ ๋กœ๊ณ ๋ฅผ ์ œ๊ณตํ•ด์ฃผ์‹  Seyun Ahn๋‹˜๊ป˜ ๊ฐ์‚ฌ๋ฅผ ์ „ํ•ฉ๋‹ˆ๋‹ค.

You might also like...
KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KoGPT KoGPT (Korean Generative Pre-trained Transformer) https://github.com/kakaobrain/kogpt https://huggingface.co/kakaobrain/kogpt Model Descriptions

Generating Korean Slogans with phonetic and structural repetition
Generating Korean Slogans with phonetic and structural repetition

LexPOS_ko Generating Korean Slogans with phonetic and structural repetition Generating Slogans with Linguistic Features LexPOS is a sequence-to-sequen

Korean extractive summarization. 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ
Korean extractive summarization. 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ

korean extractive summarization 2021 AI ํ…์ŠคํŠธ ์š”์•ฝ ์˜จ๋ผ์ธ ํ•ด์ปคํ†ค ํ™”์„ฑ๊ฐˆ๋„๋‹ˆ๊นŒํŒ€ ์ฝ”๋“œ Leaderboard Notice Text Summarization with Pretrained Encoders์— ๋‚˜์˜ค๋Š” bertsumext๋ชจ๋ธ(ext

Training code for Korean multi-class sentiment analysis

KoSentimentAnalysis Bert implementation for the Korean multi-class sentiment analysis ์™œ ํ•œ๊ตญ์–ด ๊ฐ์ • ๋‹ค์ค‘๋ถ„๋ฅ˜ ๋ชจ๋ธ์€ ๊ฑฐ์˜ ์—†๋Š” ๊ฒƒ์ผ๊นŒ?์—์„œ ์‹œ์ž‘๋œ ํ”„๋กœ์ ํŠธ Environment: Pytorch, Da

Korean Sentence Embedding Repository

Korean-Sentence-Embedding ๐Ÿญ Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ๐Ÿฆ ๐Ÿ‡ฎ๐Ÿ‡ฉ 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

Crie tokens de autenticaรงรฃo รญntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) รฉ uma bilioteca criada para ser utilizada na geraรงรฃo de tokens seguros e รญntegros, ou seja, nรฃ

Comments
  • Pretraining Epoch ์งˆ๋ฌธ

    Pretraining Epoch ์งˆ๋ฌธ

    Checklist

    • [x] I've searched the project's issues

    โ“ Question

    ์•ˆ๋…•ํ•˜์„ธ์š” ์ €๋Š” ํ˜„์žฌ ์นœ๊ตฌ๋“ค๊ณผ ํ•จ๊ป˜ 4096 ํ† ํฐ์„ ์ž…๋ ฅ๋ฐ›์•„ ์š”์•ฝ ํƒœ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒ˜์Œ์—” ๋น…๋ฒ„๋“œ + ๋ฒ„ํŠธ ์กฐํ•ฉ์œผ๋กœ ํ•ด๋ณด๋ ค๊ณ  ํ–ˆ๋Š”๋ฐ, ์ด๋ฏธ monologg ๋‹˜๊ป˜์„œ ๋งŒ๋“ค์–ด์ฃผ์…จ๋”๋ผ๊ตฌ์š” ใ…Žใ…Ž ๊ทธ๋ž˜์„œ ๋กฑํฌ๋จธ + ๋ฐ”ํŠธ + ํŽ˜๊ฐ€์ˆ˜์Šค ์กฐํ•ฉ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋ ค ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. pretrained๋œ KoBart๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์–ดํ…์…˜์„ ๋กฑํฌ๋จธ๋กœ ๋ฐ”๊พผ ํ›„, ํŽ˜๊ฐ€์ˆ˜์Šค task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ตฌ์กฐ๋กœ ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

    ํ˜„์žฌ 13GB์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์•„์„œ ์ „์ฒ˜๋ฆฌ์™€ ๋ฐ์ดํ„ฐ๋กœ๋” ์ž‘์„ฑ, ๋ชจ๋ธ ์ฝ”๋“œ๊นŒ์ง€๋Š” ์™„๋ฃŒํ•œ ์ƒํƒœ์ž…๋‹ˆ๋‹ค. ์ด๋ฒˆ ์ฃผ ๋‚ด๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋ ค ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

    ์ €ํฌ๊ฐ€ ๊ฐ€์ง„ GPU๋กœ๋Š” ๋Œ€๋žต ์ดํ‹€์ด๋ฉด 1 ์—ํฌํฌ๋ฅผ ๋Œ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์€๋ฐ, monologg๋‹˜๊ป˜์„œ๋Š” KoBirBird ๋ชจ๋ธ ๊ฐœ๋ฐœ ์‹œ ์—ํฌํฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๋„์…จ๋Š”์ง€ ์—ฌ์ญค๋ณด๊ณ  ์‹ถ์Šต๋‹ˆ๋‹ค.

    ์•„๋ฌด๋ž˜๋„ pretrained ๋œ ๋ชจ๋ธ์„ ๊ฐ€์ ธ๋‹ค ์“ฐ๋‹ค๋ณด๋‹ˆ ์—ํฌํฌ๋ฅผ ๋งŽ์ด ๋Œ ํ•„์š”๋Š” ์—†์„ ๊ฒƒ ๊ฐ™์€๋ฐ, ๊ธฐ์ค€์ ์œผ๋กœ ์‚ผ๊ณ  ์‹ถ์–ด์„œ์š”!

    ๋ง์ด ๊ธธ์–ด์กŒ๋Š”๋ฐ ์š”์•ฝํ•˜์ž๋ฉด, KoBirBird ํ•™์Šต ์‹œ ์—ํฌํฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ฃผ์…จ๋Š”์ง€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ๊ทธ ๊ธฐ์ค€์€ ๋ฌด์—‡์œผ๋กœ ์‚ผ์œผ์…จ๋Š”์ง€๋„ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.

    question 
    opened by KimJaehee0725 2
  • Specific information about this model.

    Specific information about this model.

    Checklist

    • [ x ] I've searched the project's issues

    โ“ Question

    • You mentioned "๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜, ํ•œ๊ตญ์–ด ์œ„ํ‚ค, Common Crawl, ๋‰ด์Šค ๋ฐ์ดํ„ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต" and I want to know the size of total corpus for pre-training.

    • Also I want to know the vocab size of this model.

    ๐Ÿ“Ž Additional context

    question 
    opened by midannii 2
  • Fix some minors

    Fix some minors

    Description

    ์ฝ”๋“œ์™€ ์ฃผ์„ ๋“ฑ์„ ์ฝ๋‹ค๊ฐ€ ๋ณด์ธ ์ž‘์€ ์˜คํƒ€ ๋“ฑ์„ ์ˆ˜์ •ํ–ˆ์Šต๋‹ˆ๋‹ค

    ๋‹ค์–‘ํ•œ ๋…ธํ•˜์šฐ๋ฅผ ์•„๋‚Œ์—†์ด ๊ณต์œ ํ•ด์ฃผ์‹  @monologg , @donggyukimc ์—๊ฒŒ ๊ฐ์‚ฌ์˜ ๋ง์”€๋“œ๋ฆฝ๋‹ˆ๋‹ค.

    ์ดํ›„์—๋Š” GPU ํ™˜๊ฒฝ์—์„œ finetuning์„ ํ…Œ์ŠคํŠธํ•ด ๋ณผ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค ๊ณ ๋ง™์Šต๋‹ˆ๋‹ค.

    Related Issue

    chore 
    opened by sackoh 0
Releases(v1.0.0)
Concept Modeling: Topic Modeling on Images and Text

Concept is a technique that leverages CLIP and BERTopic-based techniques to perform Concept Modeling on images.

Maarten Grootendorst 120 Dec 27, 2022
[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

LM-Critic: Language Models for Unsupervised Grammatical Error Correction This repo provides the source code & data of our paper: LM-Critic: Language M

Michihiro Yasunaga 98 Nov 24, 2022
justCTF [*] 2020 challenges sources

justCTF [*] 2020 This repo contains sources for justCTF [*] 2020 challenges hosted by justCatTheFish. TLDR: Run a challenge with ./run.sh (requires Do

justCatTheFish 25 Dec 27, 2022
Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Yomichad is a Japanese pop-up dictionary that can display readings and English definitions of Japanese words, kanji, and optionally named entities. It is similar to yomichan, 10ten, and rikaikun in s

Jonas Belouadi 7 Nov 07, 2022
NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

source code for NeurIPS21 paper robabilistic Margins for Instance Reweighting in Adversarial Training

9 Dec 20, 2022
Simple and efficient RevNet-Library with DeepSpeed support

RevLib Simple and efficient RevNet-Library with DeepSpeed support Features Half the constant memory usage and faster than RevNet libraries Less memory

Lucas Nestler 112 Dec 05, 2022
Arabic speech recognition, classification and text-to-speech.

klaam Arabic speech recognition, classification and text-to-speech using many advanced models like wave2vec and fastspeech2. This repository allows tr

ARBML 177 Dec 27, 2022
ASCEND Chinese-English code-switching dataset

ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese-English code-switching corpus collected in Hong Kong.

CAiRE 11 Dec 09, 2022
Club chatbot

Chatbot Club chatbot Instructions to get the Chatterbot working Step 1. First make sure you are using a version of Python 3 or newer. To check your ve

5 Mar 07, 2022
ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

fastNLP 48 Dec 14, 2022
Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning This repo is for Findings at EMNLP 2021 paper: Learn Cont

INK Lab @ USC 6 Sep 02, 2022
Prompt tuning toolkit for GPT-2 and GPT-Neo

mkultra mkultra is a prompt tuning toolkit for GPT-2 and GPT-Neo. Prompt tuning injects a string of 20-100 special tokens into the context in order to

61 Jan 01, 2023
Repository of the Code to Chatbots, developed in Python

Description In this repository you will find the Code to my Chatbots, developed in Python. I'll explain the structure of this Repository later. Requir

Li-am K. 0 Oct 25, 2022
NLP made easy

GluonNLP: Your Choice of Deep Learning for NLP GluonNLP is a toolkit that helps you solve NLP problems. It provides easy-to-use tools that helps you l

Distributed (Deep) Machine Learning Community 2.5k Jan 04, 2023
This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Text Summarizer This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text. Team Members This mini-project was

1 Nov 16, 2021
Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Knover Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out eff

606 Dec 28, 2022
A combination of autoregressors and autoencoders using XLNet for sentiment analysis

A combination of autoregressors and autoencoders using XLNet for sentiment analysis Abstract In this paper sentiment analysis has been performed in or

James Zaridis 2 Nov 20, 2021
C.J. Hutto 3.8k Dec 30, 2022
๐Ÿฆ… Pretrained BigBird Model for Korean (up to 4096 tokens)

Pretrained BigBird Model for Korean What is BigBird โ€ข How to Use โ€ข Pretraining โ€ข Evaluation Result โ€ข Docs โ€ข Citation ํ•œ๊ตญ์–ด | English What is BigBird? Bi

Jangwon Park 183 Dec 14, 2022
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 06, 2023