RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

Overview

version bert

RoNER

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, high-accuracy Python package providing Romanian NER.

RoNER handles text splitting, word-to-subword alignment, and it works with arbitrarily long text sequences on CPU or GPU.

Instalation & usage

Install with: pip install roner

Run with:

20} = {word['tag']}")">
import roner
ner = roner.NER()

input_texts = ["George merge cu trenul Cluj - Timișoara de ora 6:20.", 
               "Grecia are capitala la Atena."]

output_texts = ner(input_texts)

for output_text in output_texts:
  print(f"Original text: {output_text['text']}")
  for word in output_text['words']:
    print(f"{word['text']:>20} = {word['tag']}")

RoNEC input

RoNER accepts either strings or lists of strings as input. If you pass a single string, it will convert it to a list containing this string.

RoNEC output

RoNER outputs a list of dictionary objects corresponding to the given input list of strings. A dictionary entry consists of:

>, "input_ids": < >, "words": [{ "text": < >, "tag": < > "pos": < >, "multi_word_entity": < >, "span_after": < >, "start_char": < >, "end_char": < >, "token_ids": < >, "tag_ids": < > }] }">
{
  "text": <
             
              >,
             
  "input_ids": <
             
              >,
             
  "words": [{
      "text": <
             
              >,
             
      "tag": <
             
              >
             
      "pos": <
             
              >,
             
      "multi_word_entity": <
             
              >,
             
      "span_after": <>,
      "start_char": <
              
               >,
              
      "end_char": <
              
               >,
              
      "token_ids": <
              
               >,
              
      "tag_ids": <
              
               >
              
    }]
}

This information is sufficient to save word-to-subtoken alignment, to have access to the original text as well as having other usable info such as the start and end positions for each word.

To list entities, simply iterate over all the words in the dict, printing the word itself word['text'] and its label word['tag'].

RoNER properties and considerations

Constructor options

The NER constructor has the following properties:

  • model:str Override this if you want to use your own pretrained model. Specify either a HuggingFace model or a folder location. If you use a different tag set than RONECv2, you need to also override the bio2tag_list option. The default model is dumitrescustefan/bert-base-romanian-ner
  • use_gpu:bool Set to True if you want to use the GPU (much faster!). Default is enabled; if there is no GPU found, it falls back to CPU.
  • batch_size:int How many sequences to process in parallel. On an 11GB GPU you can use batch_size = 8. Default is 4. Larger values mean faster processing - increase until you get OOM errors.
  • window_size:int Model size. BERT uses by default 512. Change if you know what you're doing. RoNER uses this value to compute overlapping windows (will overlap last quarter of the window).
  • num_workers:int How many workers to use for feeding data to GPU/CPU. Default is 0, meaning use the main process for data loading. Safest option is to leave at 0 to avoid possible errors at forking on different OSes.
  • named_persons_only:bool Set to True to output only named persons labeled with the class PERSON. This parameter is further explained below.
  • verbose:bool Set to True to get processing info. Leave it at its default False value for peace and quiet.
  • bio2tag_list:list Default None, change only if you trained your own model with different ordering of the BIO2 tags.

Implicit tokenization of texts

Please note that RoNER uses Stanza to handle Romanian tokenization into words and part-of-speech tagging. On first run, it will download not only the NER transformer model, but also Stanza's Romanian data package.

'PERSON' class handling

An important aspect that requires clarification is the handling of the PERSON label. In RONECv2, persons are not only names of persons (proper nouns, aka George Mihailescu), but also any common noun that refers to a person, such as ea, fratele or doctorul. For applications that do not need to handle this scenario, please set the named_persons_only value to True in RoNER's constructor.

What this does is use the part of speech tagging provided by Stanza and only set as PERSONs proper nouns.

Multi-word entities

Sometimes, entities span multiple words. To handle this, RoNER has a special property named multi_word_entity, which, when True, means that the current entity is linked to the previous one. Single-word entities will have this property set to False, as will the first word of multi-word entities. This is necessary to distinguish between sequential multi-word entities.

Detokenization

One particular use-case for a NER is to perform text anonymization, which means to replace entities with their label. With this in mind, RoNER has a detokenization function, that, applied to the outputs, will recreate the original strings.

To perform the anonymization, iterate through all the words, and replace the word's text with its label as in word['text'] = word['tag']. Then, simply run anonymized_texts = ner.detokenize(outputs). This will preserve spaces, new-lines and other characters.

NER accuracy metrics

Finally, because we trained the model on a modified version of RONECv2 (we performed data augumentation on the sentences, used a different training scheme and other train/validation/test splits) we are unable to compare to the standard baseline of RONECv2 as part of the original test set is now included in our training data, but we have obtained, to our knowledge, SOTA results on Romanian. This repo is meant to be used in production, and not for comparisons to other models.

BibTeX entry and citation info

Please consider citing the following paper as a thank you to the authors of the RONEC, even if it describes v1 of the corpus and you are using a model trained on v2 by the same authors:

Dumitrescu, Stefan Daniel, and Andrei-Marius Avram. "Introducing RONEC--the Romanian Named Entity Corpus." arXiv preprint arXiv:1909.01247 (2019).

or in .bibtex format:

@article{dumitrescu2019introducing,
  title={Introducing RONEC--the Romanian Named Entity Corpus},
  author={Dumitrescu, Stefan Daniel and Avram, Andrei-Marius},
  journal={arXiv preprint arXiv:1909.01247},
  year={2019}
}
Owner
Stefan Dumitrescu
Machine Learning, NLP
Stefan Dumitrescu
Code for lyric-section-to-comment generation based on huggingface transformers.

CommentGeneration Code for lyric-section-to-comment generation based on huggingface transformers. Migrate Guyu model and code (both 12-layers and 24-l

Yawei Sun 8 Sep 04, 2021
A PyTorch-based model pruning toolkit for pre-trained language models

English | 中文说明 TextPruner是一个为预训练语言模型设计的模型裁剪工具包,通过轻量、快速的裁剪方法对模型进行结构化剪枝,从而实现压缩模型体积、提升模型速度。 其他相关资源: 知识蒸馏工具TextBrewer:https://github.com/airaria/TextBrewe

Ziqing Yang 231 Jan 08, 2023
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 08, 2022
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.4k Dec 27, 2022
Maha is a text processing library specially developed to deal with Arabic text.

An Arabic text processing library intended for use in NLP applications Maha is a text processing library specially developed to deal with Arabic text.

Mohammad Al-Fetyani 184 Nov 27, 2022
Perform sentiment analysis and keyword extraction on Craigslist listings

craiglist-helper synopsis Perform sentiment analysis and keyword extraction on Craigslist listings Background I love Craigslist. I've found most of my

Mark Musil 1 Nov 08, 2021
Built for cleaning purposes in military institutions

Ferramenta do AL Construído para fins de limpeza em instituições militares. Instalação Requer python = 3.2 pip install -r requirements.txt Usagem Exe

0 Aug 13, 2022
Research code for "What to Pre-Train on? Efficient Intermediate Task Selection", EMNLP 2021

efficient-task-transfer This repository contains code for the experiments in our paper "What to Pre-Train on? Efficient Intermediate Task Selection".

AdapterHub 26 Dec 24, 2022
Code for the Python code smells video on the ArjanCodes channel.

7 Python code smells This repository contains the code for the Python code smells video on the ArjanCodes channel (watch the video here). The example

55 Dec 29, 2022
Text editor on python to convert english text to malayalam(Romanization/Transiteration).

Manglish Text Editor This is a simple transiteration (romanization ) program which is used to convert manglish to malayalam (converts njaan to ഞാൻ ).

Merin Rose Tom 1 May 11, 2022
FewCLUE: 为中文NLP定制的小样本学习测评基准

FewCLUE: 为中文NLP定制的小样本学习测评基准

CLUE benchmark 387 Jan 04, 2023
Under the hood working of transformers, fine-tuning GPT-3 models, DeBERTa, vision models, and the start of Metaverse, using a variety of NLP platforms: Hugging Face, OpenAI API, Trax, and AllenNLP

Transformers-for-NLP-2nd-Edition @copyright 2022, Packt Publishing, Denis Rothman Contact me for any question you have on LinkedIn Get the book on Ama

Denis Rothman 150 Dec 23, 2022
NLP: SLU tagging

NLP: SLU tagging

北海若 3 Jan 14, 2022
Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Housegan-data-reader House-GAN++ (data-reader) Code and instructions for converting rplan dataset (raster images) to housegan++ data format. House-GAN

Sepid Hosseini 13 Nov 24, 2022
Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Ye

Yi-Chang Chen 5 Dec 15, 2022
Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

wangle 823 Dec 28, 2022
SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.

Erre Quadro Srl 384 Dec 12, 2022
translate using your voice

speech-to-text-translator Usage translate using your voice description this project makes translating a word easy, all you have to do is speak and...

1 Oct 18, 2021
Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3

Fork from https://github.com/huggingface/transformers/tree/86d5fb0b360e68de46d40265e7c707fe68c8015b/examples/pytorch/language-modeling at 2021.05.17.

Junbum Lee 12 Oct 26, 2022