PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

Last update: Dec 21, 2022

Overview

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

pororo performs Natural Language Processing and Speech-related tasks.

It is easy to solve various subtasks in the natural language and speech processing field by simply passing the task name.

Installation

pororo is based on torch=1.6(cuda 10.1) and python>=3.6
You can install a package through the command below:

pip install pororo

Or you can install it locally:

git clone https://github.com/kakaobrain/pororo.git
cd pororo
pip install -e .

For library installation for specific tasks other than the common modules, please refer to INSTALL.md
For the utilization of Automatic Speech Recognition, wav2letter should be installed separately. For the installation, please run the asr-install.sh file

bash asr-install.sh

Usage

pororo can be used as follows:
First, in order to import pororo, you must execute the following snippet

>>> from pororo import Pororo

After the import, you can check the tasks currently supported by the pororo through the following commands

>>> from pororo import Pororo
>>> Pororo.available_tasks()
"Available tasks are ['mrc', 'rc', 'qa', 'question_answering', 'machine_reading_comprehension', 'reading_comprehension', 'sentiment', 'sentiment_analysis', 'nli', 'natural_language_inference', 'inference', 'fill', 'fill_in_blank', 'fib', 'para', 'pi', 'cse', 'contextual_subword_embedding', 'similarity', 'sts', 'semantic_textual_similarity', 'sentence_similarity', 'sentvec', 'sentence_embedding', 'sentence_vector', 'se', 'inflection', 'morphological_inflection', 'g2p', 'grapheme_to_phoneme', 'grapheme_to_phoneme_conversion', 'w2v', 'wordvec', 'word2vec', 'word_vector', 'word_embedding', 'tokenize', 'tokenise', 'tokenization', 'tokenisation', 'tok', 'segmentation', 'seg', 'mt', 'machine_translation', 'translation', 'pos', 'tag', 'pos_tagging', 'tagging', 'const', 'constituency', 'constituency_parsing', 'cp', 'pg', 'collocation', 'collocate', 'col', 'word_translation', 'wt', 'summarization', 'summarisation', 'text_summarization', 'text_summarisation', 'summary', 'gec', 'review', 'review_scoring', 'lemmatization', 'lemmatisation', 'lemma', 'ner', 'named_entity_recognition', 'entity_recognition', 'zero-topic', 'dp', 'dep_parse', 'caption', 'captioning', 'asr', 'speech_recognition', 'st', 'speech_translation', 'ocr', 'srl', 'semantic_role_labeling', 'p2g', 'aes', 'essay', 'qg', 'question_generation', 'age_suitability']"

To check which models are supported by each task, you can go through the following process

>>> from pororo import Pororo
>>> Pororo.available_models("collocation")
'Available models for collocation are ([lang]: ko, [model]: kollocate), ([lang]: en, [model]: collocate.en), ([lang]: ja, [model]: collocate.ja), ([lang]: zh, [model]: collocate.zh)'

If you want to perform a specific task, you can put the task name in the task argument and the language name in the lang argument

>>> from pororo import Pororo
>>> ner = Pororo(task="ner", lang="en")

After object construction, it can be used in a way that passes the input value as follows:

>>> ner("Michael Jeffrey Jordan (born February 17, 1963) is an American businessman and former professional basketball player.")
[('Michael Jeffrey Jordan', 'PERSON'), ('(', 'O'), ('born', 'O'), ('February 17, 1963)', 'DATE'), ('is', 'O'), ('an', 'O'), ('American', 'NORP'), ('businessman', 'O'), ('and', 'O'), ('former', 'O'), ('professional', 'O'), ('basketball', 'O'), ('player', 'O'), ('.', 'O')]

If task supports multiple languages, you can change the lang argument to take advantage of models trained in different languages.

>>> ner = Pororo(task="ner", lang="ko")
>>> ner("마이클 제프리 조던(영어: Michael Jeffrey Jordan, 1963년 2월 17일 ~ )은 미국의 은퇴한 농구 선수이다.")
[('마이클 제프리 조던', 'PERSON'), ('(', 'O'), ('영어', 'CIVILIZATION'), (':', 'O'), (' ', 'O'), ('Michael Jeffrey Jordan', 'PERSON'), (',', 'O'), (' ', 'O'), ('1963년 2월 17일 ~', 'DATE'), (' ', 'O'), (')은', 'O'), (' ', 'O'), ('미국', 'LOCATION'), ('의', 'O'), (' ', 'O'), ('은퇴한', 'O'), (' ', 'O'), ('농구 선수', 'CIVILIZATION'), ('이다.', 'O')]
>>> ner = Pororo(task="ner", lang="ja")
>>> ner("マイケル・ジェフリー・ジョーダンは、アメリカ合衆国の元バスケットボール選手")
[('マイケル・ジェフリー・ジョーダン', 'PERSON'), ('は', 'O'), ('、アメリカ合衆国', 'O'), ('の', 'O'), ('元', 'O'), ('バスケットボール', 'O'), ('選手', 'O')]
>>> ner = Pororo(task="ner", lang="zh")
>>> ner("麥可·傑佛瑞·喬丹是美國退役NBA職業籃球運動員，也是一名商人，現任夏洛特黃蜂董事長及主要股東")
[('麥可·傑佛瑞·喬丹', 'PERSON'), ('是', 'O'), ('美國', 'GPE'), ('退', 'O'), ('役', 'O'), ('nba', 'ORG'), ('職', 'O'), ('業', 'O'), ('籃', 'O'), ('球', 'O'), ('運', 'O'), ('動', 'O'), ('員', 'O'), ('，', 'O'), ('也', 'O'), ('是', 'O'), ('一', 'O'), ('名', 'O'), ('商', 'O'), ('人', 'O'), ('，', 'O'), ('現', 'O'), ('任', 'O'), ('夏洛特黃蜂', 'ORG'), ('董', 'O'), ('事', 'O'), ('長', 'O'), ('及', 'O'), ('主', 'O'), ('要', 'O'), ('股', 'O'), ('東', 'O')]

If the task supports multiple models, you can change the model argument to use another model.

>>> from pororo import Pororo
>>> mt = Pororo(task="mt", lang="multi", model="transformer.large.multi.mtpg")
>>> fast_mt = Pororo(task="mt", lang="multi", model="transformer.large.multi.fast.mtpg")

Documentation

For more detailed information, see full documentation

If you have any questions or requests, please report the issue.

Citation

If you apply this library to any project and research, please cite our code:

@misc{pororo,
  author       = {Heo, Hoon and Ko, Hyunwoong and Kim, Soohwan and
                  Han, Gunsoo and Park, Jiwoo and Park, Kyubyong},
  title        = {PORORO: Platform Of neuRal mOdels for natuRal language prOcessing},
  howpublished = {\url{https://github.com/kakaobrain/pororo}},
  year         = {2021},
}

Contributors

Hoon Heo, Hyunwoong Ko, Soohwan Kim, Gunsoo Han, Jiwoo Park and Kyubyong Park

License

PORORO project is licensed under the terms of the Apache License 2.0.

Comments

Fix typo on para_gen docstrings and html
Title

fix typo on para_gen docstrings and html

Description

Englosh to English

Linked Issues

resolved #43

MRC랑 한번에 PR 했어야 했는데.. 여러모로 번거롭게 해드려서 죄송합니다...
opened by SDSTony 1
Fix typo on machine_reading_comprehension.py and mrc.html
Title

Fix typo on machine_reading_comprehension.py and mrc.html

Description

Fix typo comprehesion to comprehension found on

machine_reading_comprehension.py docstring

mrc.html

Linked Issues

resolved #41
opened by SDSTony 1
Fix typo on age_suitability.html
fix typo from nudiy to nudity

Title

fix typo on age_suitability.html

Description

There is a typo on age_suitability.html page. I think the word Nudiy should be fixed into Nudity. I've edited the html file directly in this PR. If this isn't a proper way to edit a published web document, please cancel this PR. Thank you.

Linked Issues

#39
opened by SDSTony 1

Improve MRC inference and change output

Title

Improve MRC inference and change output

Summary

Predict span using top10 start&end position
Add score output
Add logit output

Description

In predicting span in the MRC, the existing code used only the maximum value of start position and end position. For a more accurate inference, the top 10 start positions and end positions were used to predict the highest score span. At this time, the score is defined as the sum of start logit and end logit. Finally, I added logit and score to the output for user convenience.

Examples

>>> mrc = Pororo(task="mrc", lang="ko")
>>> mrc(
>>>    "카카오브레인이 공개한 것은?",
>>>    "카카오 인공지능(AI) 연구개발 자회사 카카오브레인이 AI 솔루션을 첫 상품화했다. 카카오는 카카오브레인 '포즈(pose·자세분석) API'를 유료 공개한다고 24일 밝혔다. 카카오브레인이 AI 기술을 유료 API를 공개하는 것은 처음이다. 공개하자마자 외부 문의가 쇄도한다. 포즈는 AI 비전(VISION, 영상·화면분석) 분야 중 하나다. 카카오브레인 포즈 API는 이미지나 영상을 분석해 사람 자세를 추출하는 기능을 제공한다."
>>> )
('포즈(pose·자세분석) API',
 (33, 44),
 (5.7833147048950195, 4.649877548217773),
 10.433192253112793)
>>> # when mecab doesn't work well for postprocess, you can set `postprocess` option as `False`
>>> mrc("카카오브레인이 공개한 라이브러리 이름은?", "카카오브레인은 자연어 처리와 음성 관련 태스크를 쉽게 수행할 수 있도록 도와 주는 라이브러리 pororo를 공개하였습니다.", postprocess=False)
('pororo', (31, 35), (8.656489372253418, 8.14583683013916), 16.802326202392578)

opened by skaurl 0

Fixed Code Quality Issues
Title

Fixed Code Quality Issues

Description

Summary:

Remove unnecessary generator

Remove methods with an unnecessary super delegation

Remove redundant None

Add .deepsource.toml

I ran a DeepSource Analysis on my fork of this repository. You can see all the issues raised by DeepSource here.

DeepSource helps you to automatically find and fix issues in your code during code reviews. This tool looks for anti-patterns, bug risks, performance problems, and raises issues. There are plenty of other issues in relation to Bug Discovery and Anti-Patterns which you would be interested to take a look at.

If you do not want to use DeepSource to continuously analyze this repo, I'll remove the .deepsource.toml from this PR and you can merge the rest of the fixes. If you want to setup DeepSource for Continuous Analysis, I can help you set that up.
opened by HarshCasper 0
Update TTS example comment
Title

Update TTS example comment

Description

Update TTS example comment (Cross-lingual Voice Style Transfer => Code-Switching)

Linked Issues

resolved #00
opened by sooftware 0
Delete unuse files & Add tts example ipynb
Title

Delete unuse files & Add tts example ipynb

Description

Delete unuse files (examples/.ipynb/, examples/Untitle.ipynb)

Add examples/speech_synthesis.ipynb

Linked Issues

resolved #00
opened by sooftware 0
Update TTS
Title

Denote TTS INSTALL.md & 3rd_party_model & Add tts-install.sh

Description

Denote TTS install requirements

Denote 3rd_party_model (TTS)

Add tts-install.sh

Test complete

docstring example update

Linked Issues

resolved #00
opened by sooftware 0
Mount TTS
Title

Mount TTS

Description

Mount TTS (Text-To-Speech) Task

Update LICENSE.3rd_party_library

Add test file (tts)

demo page (Not yet completed)

Linked Issues

resolved #00
opened by sooftware 0
Feature/6 kwargs
Title

Add kwargs to __call__ and predict

Description

Add kwargs to __call__ and predict to prevent generate unnecessary custom predict function

Linked Issues

resolved #6
opened by Huffon 0

fix: prevent OSError: read-only file system error

Description

I found that there is a chance of OSError to occur when we try to load models into a temporary directory such as in the strictly managed environment like some containers on the cloud.

[2022-03-23 04:07:37,080] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]     review_scoring_model = Pororo(task="review", lang="ko")
[2022-03-23 04:07:37,080] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]   File "/usr/local/lib/python3.8/site-packages/pororo/pororo.py", line 203, in __new__
[2022-03-23 04:07:37,080] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]     task_module = SUPPORTED_TASKS[task](
[2022-03-23 04:07:37,080] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]   File "/usr/local/lib/python3.8/site-packages/pororo/tasks/review_scoring.py", line 86, in load
[2022-03-23 04:07:37,081] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]     model = (BrainRobertaModel.load_model(
[2022-03-23 04:07:37,081] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]   File "/usr/local/lib/python3.8/site-packages/pororo/models/brainbert/BrainRoBERTa.py", line 33, in load_model
[2022-03-23 04:07:37,081] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]     ckpt_dir = download_or_load(model_name, lang)
[2022-03-23 04:07:37,081] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]   File "/usr/local/lib/python3.8/site-packages/pororo/tasks/utils/download_utils.py", line 318, in download_or_load
[2022-03-23 04:07:37,081] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]     return download_or_load_bert(info)
[2022-03-23 04:07:37,081] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]   File "/usr/local/lib/python3.8/site-packages/pororo/tasks/utils/download_utils.py", line 104, in download_or_load_bert
[2022-03-23 04:07:37,081] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]     type_dir = download_from_url(
[2022-03-23 04:07:37,081] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]   File "/usr/local/lib/python3.8/site-packages/pororo/tasks/utils/download_utils.py", line 288, in download_from_url
[2022-03-23 04:07:37,081] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]     wget.download(url, type_dir)
[2022-03-23 04:07:37,081] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]   File "/usr/local/lib/python3.8/site-packages/wget.py", line 506, in download
[2022-03-23 04:07:37,081] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]     (fd, tmpfile) = tempfile.mkstemp(".tmp", prefix=prefix, dir=".")
[2022-03-23 04:07:37,081] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]   File "/usr/local/lib/python3.8/tempfile.py", line 331, in mkstemp
[2022-03-23 04:07:37,081] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]     return _mkstemp_inner(dir, prefix, suffix, flags, output_type)
[2022-03-23 04:07:37,081] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]   File "/usr/local/lib/python3.8/tempfile.py", line 250, in _mkstemp_inner
[2022-03-23 04:07:37,081] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000]     fd = _os.open(file, flags, 0o600)
[2022-03-23 04:07:37,082] {ecs.py:362} INFO - [2022-03-23T04:07:12.901000] OSError: [Errno 30] Read-only file system: './brainbert.base.ko.review_rating.zip4zkvg88b.tmp'

This commit will prevent that to happen. The code for the new function 'download' is originated from wget library written by anatoly techtonik with slight revision done by me.

opened by daun-io 0

Improve MRC inference and change output

Title

Improve MRC inference and change output

Summary

Predict span using top10 start&end position
Add score output
Add logit output

Description

Examples

>>> mrc = Pororo(task="mrc", lang="ko")
>>> mrc(
>>>    "카카오브레인이 공개한 것은?",
>>>    "카카오 인공지능(AI) 연구개발 자회사 카카오브레인이 AI 솔루션을 첫 상품화했다. 카카오는 카카오브레인 '포즈(pose·자세분석) API'를 유료 공개한다고 24일 밝혔다. 카카오브레인이 AI 기술을 유료 API를 공개하는 것은 처음이다. 공개하자마자 외부 문의가 쇄도한다. 포즈는 AI 비전(VISION, 영상·화면분석) 분야 중 하나다. 카카오브레인 포즈 API는 이미지나 영상을 분석해 사람 자세를 추출하는 기능을 제공한다."
>>> )
('포즈(pose·자세분석) API',
 (33, 44),
 (5.7833147048950195, 4.649877548217773),
 10.433192253112793)
>>> # when mecab doesn't work well for postprocess, you can set `postprocess` option as `False`
>>> mrc("카카오브레인이 공개한 라이브러리 이름은?", "카카오브레인은 자연어 처리와 음성 관련 태스크를 쉽게 수행할 수 있도록 도와 주는 라이브러리 pororo를 공개하였습니다.", postprocess=False)
('pororo', (31, 35), (8.656489372253418, 8.14583683013916), 16.802326202392578)

opened by skaurl 0

Releases(0.4.0)

0.4.0(Feb 12, 2021)
Fix CPU-only machine error for Text Summarization (#11)

Apply kwargs to every predict func (#13)

Add Word Sense Disambiguation for Korean

Add apply_wsd args for Korean Named Entity Recognition

Add Speech Synthesis module (#27)

Fix show_probs KeyError for Japanese Sentiment Analysis (#28)

Source code(tar.gz)
Source code(zip)
0.3.2(Feb 3, 2021)
Bug fixes:

change typing.OrderedDict to collections.OrderedDict

install dataclasses if python version is lower than 3.7

Source code(tar.gz)
Source code(zip)
0.3.1(Feb 2, 2021)
PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

pororo performs Natural Language Processing and Speech-related tasks.

It is easy to solve various subtasks in the natural language and speech processing field by simply passing the task name.

Supported Tasks

You can see more information here !

TEXT CLASSIFICATION

Automated Essay Scoring

Age Suitability Prediction

Natural Language Inference

Paraphrase Identification

Review Scoring

Semantic Textual Similarity

Sentence Embedding

Sentiment Analysis

Zero-shot Topic Classification

SEQUENCE TAGGING

Contextualized Embedding

Dependency Parsing

Fill-in-the-blank

Machine Reading Comprehension

Named Entity Recognition

Part-of-Speech Tagging

Semantic Role Labeling

SEQ2SEQ

Constituency Parsing

Grammatical Error Correction

Grapheme-to-Phoneme

Phoneme-to-Grapheme

Machine Translation

Paraphrase Generation

Question Generation

Text Summarization

MISC.

Automatic Speech Recognition

Image Captioning

Collocation

Lemmatization

Morphological Inflection

Optical Character Recognition

Tokenization

Word Translation

Source code(tar.gz)
Source code(zip)

Owner

Kakao Brain

Kakao Brain Corp.

GitHub Repository https://kakaobrain.github.io/pororo

Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库，可以方便的处理中文文本内容，是受到了TextBlob的启发而写的，由于现在大部分的自然语言处理库基本都是针对英文的，于是写了一个方便处理中文的类库，并且和TextBlob

6k Jan 02, 2023

CATs: Semantic Correspondence with Transformers

CATs: Semantic Correspondence with Transformers For more information, check out the paper on [arXiv]. Training with different backbones and evaluation

74 Dec 10, 2021

iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

435 Jan 06, 2023

A python package to fine-tune transformer-based models for named entity recognition (NER).

nerblackbox A python package to fine-tune transformer-based language models for named entity recognition (NER). Resources Source Code: https://github.

13 Jul 30, 2022

NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

source code for NeurIPS21 paper robabilistic Margins for Instance Reweighting in Adversarial Training

9 Dec 20, 2022

The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

Unsupervised technique to Glossary and Definition Extraction Code Files GPT2-DefinitionModel.ipynb - GPT-2 model for definition generation. Data_Gener

28 May 25, 2021

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

Related tags

Overview

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

Installation

Usage

Documentation

Citation

Contributors

License

Comments

Title

Description

Linked Issues

Title

Description

Linked Issues

Title

Description

Linked Issues

Title

Summary

Description

Examples

Title

Description

Title

Description

Linked Issues

Title

Description

Linked Issues

Title

Description

Linked Issues

Title

Description

Linked Issues

Title

Description

Linked Issues

Description

Title

Summary

Description

Examples

Releases(0.4.0)

0.4.0(Feb 12, 2021)

0.3.2(Feb 3, 2021)

0.3.1(Feb 2, 2021)

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

Supported Tasks

TEXT CLASSIFICATION

SEQUENCE TAGGING

SEQ2SEQ

MISC.

Owner

Kakao Brain

Python library for processing Chinese text

CATs: Semantic Correspondence with Transformers

iBOT: Image BERT Pre-Training with Online Tokenizer

A python package to fine-tune transformer-based models for named entity recognition (NER).

NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

Mesh TensorFlow: Model Parallelism Made Easier

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch

Easy-to-use CPM for Chinese text generation

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

A music comments dataset, containing 39,051 comments for 27,384 songs.

Which Apple Keeps Which Doctor Away? Colorful Word Representations with Visual Oracles

Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

An assignment from my grad-level data mining course demonstrating some experience with NLP/neural networks/Pytorch

A high-level Python library for Quantum Natural Language Processing

IMDB film review sentiment classification based on BERT's supervised learning model.

✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

Minimal GUI for accessing the Watson Text to Speech service.