Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Last update: Aug 26, 2022

Related tags

Overview

Indobenchmark Toolkit

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources for Bahasa Indonesia such as Institut Teknologi Bandung, Universitas Multimedia Nusantara, The Hong Kong University of Science and Technology, Universitas Indonesia, DeepMind, Gojek, and Prosa.AI.

Research Paper

IndoNLU has been accepted by AACL-IJCNLP 2020 and you can find the details in our paper https://www.aclweb.org/anthology/2020.aacl-main.85.pdf. If you are using any component on IndoNLU including Indo4B, FastText-Indo4B, or IndoBERT in your work, please cite the following paper:

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}

IndoNLG has been accepted by EMNLP 2021 and you can find the details in our paper https://arxiv.org/abs/2104.08200. If you are using any component on IndoNLG including Indo4B-Plus, IndoBART, or IndoGPT in your work, please cite the following paper:

@misc{cahyawijaya2021indonlg,
      title={IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation}, 
      author={Samuel Cahyawijaya and Genta Indra Winata and Bryan Wilie and Karissa Vincentio and Xiaohong Li and Adhiguna Kuncoro and Sebastian Ruder and Zhi Yuan Lim and Syafri Bahar and Masayu Leylia Khodra and Ayu Purwarianti and Pascale Fung},
      year={2021},
      eprint={2104.08200},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

IndoNLU and IndoNLG Models

IndoBERT and IndoBERT-lite Models

We provide 4 IndoBERT and 4 IndoBERT-lite Pretrained Language Model [Link]

IndoBERT-base
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-large
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-lite-base
- Phase 1 [Link]
- Phase 2 [Link]
IndoBERT-lite-large
- Phase 1 [Link]
- Phase 2 [Link]

FastText (Indo4B)

We provide the full uncased FastText model file (11.9 GB) and the corresponding Vector file (3.9 GB)

FastText model (11.9 GB) [Link]
Vector file (3.9 GB) [Link]

We provide smaller FastText models with smaller vocabulary for each of the 12 downstream tasks

FastText-Indo4B [Link]
FastText-CC-ID [Link]

IndoBART and IndoGPT Models

We provide IndoBART and IndoGPT Pretrained Language Model [Link]

IndoBART [Link]
IndoBART-v2 [Link]
IndoGPT2 [Link]

You might also like...

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

310 Feb 1, 2021

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics. Jury offers a smooth and easy-to-use interface. It uses datasets for underlying metric computation, and hence adding custom metric is easy as adopting datasets.Metric.

129 Jan 6, 2023

Code for the paper "Flexible Generation of Natural Language Deductions"

12 Nov 11, 2022

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

Releases(v0.1.4)

v0.1.4(Jun 22, 2022)
Fix spacing between subword when decoding using IndoNLGTokenizer

Remove unused additional special tokens '[java]', '[sunda]', '[indonesia]' from IndoNLGTokenizer (language tokens are included in the special_tokens_to_ids instead)

Source code(tar.gz)
Source code(zip)
indobenchmark-toolkit-0.1.4.tar.gz(13.62 KB)

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Related tags

Overview

Indobenchmark Toolkit

Research Paper

IndoNLU and IndoNLG Models

IndoBERT and IndoBERT-lite Models

FastText (Indo4B)

IndoBART and IndoGPT Models

You might also like...

An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Code for the paper "Flexible Generation of Natural Language Deductions"

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

A python framework to transform natural language questions to queries in a database query language.

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

NL. The natural language programming language.

Releases(v0.1.4)

v0.1.4(Jun 22, 2022)

Owner

Samuel Cahyawijaya

SurvTRACE: Transformers for Survival Analysis with Competing Events

Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

Outreachy TFX custom component project

Perform sentiment analysis on textual data that people generally post on websites like social networks and movie review sites.

AI and Machine Learning workflows on Anthos Bare Metal.

Transformers-regression - Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing Regressions In NLP Model Updates

TTS is a library for advanced Text-to-Speech generation.

Linear programming solver for paper-reviewer matching and mind-matching

Pipeline for chemical image-to-text competition

Türkçe küfürlü içerikleri bulan bir yapay zeka kütüphanesi / An ML library for profanity detection in Turkish sentences

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

Uncomplete archive of files from the European Nopsled Team

Image2pcl - Enter the metaverse with 2D image to 3D projections

硕士期间自学的NLP子任务，供学习参考

AudioCLIP Extending CLIP to Image, Text and Audio

🎐 a python library for doing approximate and phonetic matching of strings.

Multi Task Vision and Language