RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

Overview

version bert

RoNER

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, high-accuracy Python package providing Romanian NER.

RoNER handles text splitting, word-to-subword alignment, and it works with arbitrarily long text sequences on CPU or GPU.

Instalation & usage

Install with: pip install roner

Run with:

20} = {word['tag']}")">
import roner
ner = roner.NER()

input_texts = ["George merge cu trenul Cluj - Timișoara de ora 6:20.", 
               "Grecia are capitala la Atena."]

output_texts = ner(input_texts)

for output_text in output_texts:
  print(f"Original text: {output_text['text']}")
  for word in output_text['words']:
    print(f"{word['text']:>20} = {word['tag']}")

RoNEC input

RoNER accepts either strings or lists of strings as input. If you pass a single string, it will convert it to a list containing this string.

RoNEC output

RoNER outputs a list of dictionary objects corresponding to the given input list of strings. A dictionary entry consists of:

>, "input_ids": < >, "words": [{ "text": < >, "tag": < > "pos": < >, "multi_word_entity": < >, "span_after": < >, "start_char": < >, "end_char": < >, "token_ids": < >, "tag_ids": < > }] }">
{
  "text": <
             
              >,
             
  "input_ids": <
             
              >,
             
  "words": [{
      "text": <
             
              >,
             
      "tag": <
             
              >
             
      "pos": <
             
              >,
             
      "multi_word_entity": <
             
              >,
             
      "span_after": <>,
      "start_char": <
              
               >,
              
      "end_char": <
              
               >,
              
      "token_ids": <
              
               >,
              
      "tag_ids": <
              
               >
              
    }]
}

This information is sufficient to save word-to-subtoken alignment, to have access to the original text as well as having other usable info such as the start and end positions for each word.

To list entities, simply iterate over all the words in the dict, printing the word itself word['text'] and its label word['tag'].

RoNER properties and considerations

Constructor options

The NER constructor has the following properties:

  • model:str Override this if you want to use your own pretrained model. Specify either a HuggingFace model or a folder location. If you use a different tag set than RONECv2, you need to also override the bio2tag_list option. The default model is dumitrescustefan/bert-base-romanian-ner
  • use_gpu:bool Set to True if you want to use the GPU (much faster!). Default is enabled; if there is no GPU found, it falls back to CPU.
  • batch_size:int How many sequences to process in parallel. On an 11GB GPU you can use batch_size = 8. Default is 4. Larger values mean faster processing - increase until you get OOM errors.
  • window_size:int Model size. BERT uses by default 512. Change if you know what you're doing. RoNER uses this value to compute overlapping windows (will overlap last quarter of the window).
  • num_workers:int How many workers to use for feeding data to GPU/CPU. Default is 0, meaning use the main process for data loading. Safest option is to leave at 0 to avoid possible errors at forking on different OSes.
  • named_persons_only:bool Set to True to output only named persons labeled with the class PERSON. This parameter is further explained below.
  • verbose:bool Set to True to get processing info. Leave it at its default False value for peace and quiet.
  • bio2tag_list:list Default None, change only if you trained your own model with different ordering of the BIO2 tags.

Implicit tokenization of texts

Please note that RoNER uses Stanza to handle Romanian tokenization into words and part-of-speech tagging. On first run, it will download not only the NER transformer model, but also Stanza's Romanian data package.

'PERSON' class handling

An important aspect that requires clarification is the handling of the PERSON label. In RONECv2, persons are not only names of persons (proper nouns, aka George Mihailescu), but also any common noun that refers to a person, such as ea, fratele or doctorul. For applications that do not need to handle this scenario, please set the named_persons_only value to True in RoNER's constructor.

What this does is use the part of speech tagging provided by Stanza and only set as PERSONs proper nouns.

Multi-word entities

Sometimes, entities span multiple words. To handle this, RoNER has a special property named multi_word_entity, which, when True, means that the current entity is linked to the previous one. Single-word entities will have this property set to False, as will the first word of multi-word entities. This is necessary to distinguish between sequential multi-word entities.

Detokenization

One particular use-case for a NER is to perform text anonymization, which means to replace entities with their label. With this in mind, RoNER has a detokenization function, that, applied to the outputs, will recreate the original strings.

To perform the anonymization, iterate through all the words, and replace the word's text with its label as in word['text'] = word['tag']. Then, simply run anonymized_texts = ner.detokenize(outputs). This will preserve spaces, new-lines and other characters.

NER accuracy metrics

Finally, because we trained the model on a modified version of RONECv2 (we performed data augumentation on the sentences, used a different training scheme and other train/validation/test splits) we are unable to compare to the standard baseline of RONECv2 as part of the original test set is now included in our training data, but we have obtained, to our knowledge, SOTA results on Romanian. This repo is meant to be used in production, and not for comparisons to other models.

BibTeX entry and citation info

Please consider citing the following paper as a thank you to the authors of the RONEC, even if it describes v1 of the corpus and you are using a model trained on v2 by the same authors:

Dumitrescu, Stefan Daniel, and Andrei-Marius Avram. "Introducing RONEC--the Romanian Named Entity Corpus." arXiv preprint arXiv:1909.01247 (2019).

or in .bibtex format:

@article{dumitrescu2019introducing,
  title={Introducing RONEC--the Romanian Named Entity Corpus},
  author={Dumitrescu, Stefan Daniel and Avram, Andrei-Marius},
  journal={arXiv preprint arXiv:1909.01247},
  year={2019}
}
Owner
Stefan Dumitrescu
Machine Learning, NLP
Stefan Dumitrescu
NVDA, the free and open source Screen Reader for Microsoft Windows

NVDA NVDA (NonVisual Desktop Access) is a free, open source screen reader for Microsoft Windows. It is developed by NV Access in collaboration with a

NV Access 1.6k Jan 07, 2023
Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

anlp21 Course materials for "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley) Syllabus: http://people.ischool.berkeley.edu/~dba

David Bamman 48 Dec 06, 2022
PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

Ryuichi Yamamoto 279 Dec 09, 2022
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

DeeBERT This is the code base for the paper DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. Code in this repository is also available

Castorini 132 Nov 14, 2022
Awesome-NLP-Research (ANLP)

Awesome-NLP-Research (ANLP)

Language, Information, and Learning at Yale 72 Dec 19, 2022
Code for evaluating Japanese pretrained models provided by NTT Ltd.

japanese-dialog-transformers 日本語の説明文はこちら This repository provides the information necessary to evaluate the Japanese Transformer Encoder-decoder dialo

NTT Communication Science Laboratories 216 Dec 22, 2022
Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

Uyghur 11 Nov 17, 2022
Unsupervised intent recognition

INTENT author: steeve LAQUITAINE description: deployment pattern: currently batch only Setup & run git clone https://github.com/slq0/intent.git bash

sl 1 Apr 08, 2022
PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

Facebook Research 2.7k Dec 27, 2022
A demo for end-to-end English and Chinese text spotting using ABCNet.

ABCNet_Chinese A demo for end-to-end English and Chinese text spotting using ABCNet. This is an old model that was trained a long ago, which serves as

Yuliang Liu 45 Oct 04, 2022
Abhijith Neil Abraham 2 Nov 05, 2021
Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

Maha 490 Dec 15, 2022
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 03, 2023
A combination of autoregressors and autoencoders using XLNet for sentiment analysis

A combination of autoregressors and autoencoders using XLNet for sentiment analysis Abstract In this paper sentiment analysis has been performed in or

James Zaridis 2 Nov 20, 2021
Fully featured implementation of Routing Transformer

Routing Transformer A fully featured implementation of Routing Transformer. The paper proposes using k-means to route similar queries / keys into the

Phil Wang 246 Jan 02, 2023
An easier way to build neural search on the cloud

An easier way to build neural search on the cloud Jina is a deep learning-powered search framework for building cross-/multi-modal search systems (e.g

Jina AI 17.1k Jan 09, 2023
Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS)

TOPSIS implementation in Python Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) CHING-LAI Hwang and Yoon introduced TOPSIS

Hamed Baziyad 8 Dec 10, 2022
A high-level Python library for Quantum Natural Language Processing

lambeq About lambeq is a toolkit for quantum natural language processing (QNLP). Documentation: https://cqcl.github.io/lambeq/ Getting started Prerequ

Cambridge Quantum 315 Jan 01, 2023
(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

BERT Convolutions Code for the paper Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models. Contains expe

mlpc-ucsd 21 Jul 18, 2022
AI-powered literature discovery and review engine for medical/scientific papers

AI-powered literature discovery and review engine for medical/scientific papers paperai is an AI-powered literature discovery and review engine for me

NeuML 819 Dec 30, 2022