IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

Last update: Nov 30, 2022

Overview

IndoBERTweet 🐦 🇮🇩

1. Paper

Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), Dominican Republic (virtual).

2. About

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually trained Indonesian BERT model with additive domain-specific vocabulary.

In this paper, we show that initializing domain-specific vocabulary with average-pooling of BERT subword embeddings is more efficient than pretraining from scratch, and more effective than initializing based on word2vec projections.

3. Pretraining Data

We crawl Indonesian tweets over a 1-year period using the official Twitter API, from December 2019 to December 2020, with 60 keywords covering 4 main topics: economy, health, education, and government. We obtain in total of 409M word tokens, two times larger than the training data used to pretrain IndoBERT. Due to Twitter policy, this pretraining data will not be released to public.

4. How to use

Load model and tokenizer (tested with transformers==3.5.1)

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("indolem/indobertweet-base-uncased")
model = AutoModel.from_pretrained("indolem/indobertweet-base-uncased")

Preprocessing Steps:

lower-case all words
converting user mentions and URLs into @USER and HTTPURL, respectively
translating emoticons into text using the emoji package.

5. Results over 7 Indonesian Twitter Datasets

Models	Sentiment		Emotion	Hate Speech		NER		Average
Models	IndoLEM	SmSA	EmoT	HS1	HS2	Formal	Informal	Average
mBERT	76.6	84.7	67.5	85.1	75.1	85.2	83.2	79.6
malayBERT	82.0	84.1	74.2	85.0	81.9	81.9	81.3	81.5
IndoBERT (Willie, et al., 2020)	84.1	88.7	73.3	86.8	80.4	86.3	84.3	83.4
IndoBERT (Koto, et al., 2020)	84.1	87.9	71.0	86.4	79.3	88.0	86.9	83.4
IndoBERTweet (1M steps from scratch)	86.2	90.4	76.0	88.8	87.5	88.1	85.4	86.1
IndoBERT + Voc adaptation + 200k steps	86.6	92.7	79.0	88.4	84.0	87.7	86.9	86.5

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

Related tags

Overview

IndoBERTweet 🐦 🇮🇩

1. Paper

2. About

3. Pretraining Data

4. How to use

5. Results over 7 Indonesian Twitter Datasets

Owner

IndoLEM

EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

내부 작업용 django + vue(vuetify) boilerplate. 짠 하면 돌아감.

The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Implementation of legal QA system based on SentenceKoBART

Telegram bot to auto post messages of one channel in another channel as soon as it is posted, without the forwarded tag.

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

A repo for materials relating to the tutorial of CS-332 NLP

A Facebook Messenger Chatbot using NLP

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

A python gui program to generate reddit text to speech videos from the id of any post.

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Training code for Korean multi-class sentiment analysis

TruthfulQA: Measuring How Models Imitate Human Falsehoods

Kinky furry assitant based on GPT2

LCG T-TEST USING EUCLIDEAN METHOD

"Investigating the Limitations of Transformers with Simple Arithmetic Tasks", 2021